F-Seq2: improving the feature density based peak caller with dynamic statistics

Superior Performance ◽

Open Chromatin ◽

Test Statistics ◽

Peak Calling ◽

User Input ◽

Distance Analysis ◽

Sequencing Technologies ◽

A Genome ◽

ABSTRACTGenomic and epigenomic features are captured at a genome-wide level by using high-throughput sequencing technologies. Peak calling is one of the first essential steps in analyzing these features by delineating regions such as open chromatin regions and transcription factor binding sites. Our original peak calling software, F-Seq, has been widely used and shown to be the most sensitive and accurate peak caller for DNase I hypersensitive sites sequencing (DNase-seq) data. However, F-Seq lacks support for user-input control dataset nor reporting test statistics, limiting its ability to capture systematic and experimental biases and accurately estimate background distributions. Here we present an improved version, F-Seq2, which combined the power of kernel density estimation and a dynamic “continuous” Poisson distribution to robustly account for local biases and solve ties when ranking candidate peaks. In F-score and motif distance analysis, we demonstrated the superior performance of F-Seq2 than other competing peak callers used by the ENCODE Consortium on simulated and real ATAC-seq and ChIP-seq datasets. The output of F-Seq2 is suitable for irreproducible discovery rate (IDR) analysis as the test statistics calculated for individual candidate summit and ties are robustly solved.

F-Seq2: improving the feature density based peak caller with dynamic statistics

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab012 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nanxiang Zhao ◽

Alan P Boyle

Keyword(s):

Open Chromatin ◽

Test Statistic ◽

User Input ◽

Rate Analysis ◽

Genome Wide ◽

Input Control ◽

A Genome ◽

Random Expectation ◽

Abstract Genomic and epigenomic features are captured at a genome-wide level by using high-throughput sequencing (HTS) technologies. Peak calling delineates features identified in HTS experiments, such as open chromatin regions and transcription factor binding sites, by comparing the observed read distributions to a random expectation. Since its introduction, F-Seq has been widely used and shown to be the most sensitive and accurate peak caller for DNase I hypersensitive site (DNase-seq) data. However, the first release (F-Seq1) has two key limitations: lack of support for user-input control datasets, and poor test statistic reporting. These constrain its ability to capture systematic and experimental biases inherent to the background distributions in peak prediction, and to subsequently rank predicted peaks by confidence. To address these limitations, we present F-Seq2, which combines kernel density estimation and a dynamic ‘continuous’ Poisson test to account for local biases and accurately rank candidate peaks. The output of F-Seq2 is suitable for irreproducible discovery rate analysis as test statistics are calculated for individual candidate summits, allowing direct comparison of predictions across replicates. These improvements significantly boost the performance of F-Seq2 for ATAC-seq and ChIP-seq datasets, outperforming competing peak callers used by the ENCODE Consortium in terms of precision and recall.

WACS: Improving ChIP-seq Peak Calling by Optimally Weighting Controls

10.1101/582650 ◽

2019 ◽

Author(s):

Aseel Awdeh ◽

Marcel Turcotte ◽

Theodore J. Perkins

Keyword(s):

Chromatin Immunoprecipitation ◽

Noise Distribution ◽

Least Squares Regression ◽

Peak Calling ◽

Background Signal ◽

Weighted Analysis ◽

Different Types ◽

Calling Algorithm ◽

AbstractMotivationChromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results.ResultsWe propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses.ConclusionThis ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.

LanceOtron: a deep learning peak caller for ATAC-seq, ChIP-seq, and DNase-seq

10.1101/2021.01.25.428108 ◽

2021 ◽

Author(s):

Lance D. Hentges ◽

Martin J. Sergeant ◽

Damien J. Downes ◽

Jim R. Hughes ◽

Stephen Taylor

Keyword(s):

Image Recognition ◽

Data Extraction ◽

Chromatin Modification ◽

Genomic Data ◽

Analytical Techniques ◽

Great Promise ◽

Open Chromatin ◽

Peak Calling ◽

General Utility ◽

AbstractGenomics technologies, such as ATAC-seq, ChIP-seq, and DNase-seq, have revolutionized molecular biology, generating a complete genome’s worth of signal in a single assay. Coupled with the use of genome browsers, researchers can now see and identify important DNA encoded elements as peaks in an analog signal. Despite the ease with which humans can visually identify peaks, converting these signals into meaningful genome-wide peak calls from such massive datasets requires complex analytical techniques. Current methods use statistical frameworks to identify peaks as sites of significant signal enrichment, discounting that the analog data do not follow any archetypal distribution. Recent advances in artificial intelligence have shown great promise in image recognition, on par or exceeding human ability, providing an opportunity to reimagine and improve peak calling. We present an interactive and intuitive peak calling framework, LanceOtron, built around image recognition using a wide and deep neural network. We hand-labelled 499Mb of genomic data, built 5,000 models, and tested with over 100 unique users from labs around the world. In benchmarking open chromatin, transcription factor binding, and chromatin modification datasets, LanceOtron outperforms the long-standing, gold-standard peak caller MACS2 with its increased selectivity and near perfect sensitivity. Additionally, this command-line optional approach allows researchers to easily generate optimal peak-calls using only a web interface. Together, the enhanced performance, and usability of LanceOtron will improve the reliability and reproducibility of peak calls and subsequent data analysis. This tool highlights the general utility of applying machine learning to genomic data extraction and analysis.

A Convolutional Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

10.1101/2020.09.29.318642 ◽

2020 ◽

Author(s):

Ziqi Ke ◽

Haris Vikalo

Keyword(s):

Dimensional Space ◽

Superior Performance ◽

Stochastic Gradient Descent ◽

Viral Quasispecies ◽

Sequencing Data ◽

Consensus Sequences ◽

Haplotype Assembly ◽

Sequencing Technologies ◽

Low Dimensional

AbstractHaplotype assembly and viral quasispecies reconstruction are challenging tasks concerned with analysis of genomic mixtures using sequencing data. High-throughput sequencing technologies generate enormous amounts of short fragments (reads) which essentially oversample components of a mixture; the representation redundancy enables reconstruction of the components (haplotypes, viral strains). The reconstruction problem, known to be NP-hard, boils down to grouping together reads originating from the same component in a mixture. Existing methods struggle to solve this problem with required level of accuracy and low runtimes; the problem is becoming increasingly more challenging as the number and length of the components increase. This paper proposes a read clustering method based on a convolutional auto-encoder designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features. The components are reconstructed by finding consensus sequences that agglomerate reads from the same origin. Mini-batch stochastic gradient descent and dimension reduction of reads allow the proposed method to efficiently deal with massive numbers of long reads. Experiments on simulated, semi-experimental and experimental data demonstrate the ability of the proposed method to accurately reconstruct haplotypes and viral quasispecies, often demonstrating superior performance compared to state-of-the-art methods.

WACS: improving ChIP-seq peak calling by optimally weighting controls

BMC Bioinformatics ◽

10.1186/s12859-020-03927-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Aseel Awdeh ◽

Marcel Turcotte ◽

Theodore J. Perkins

Keyword(s):

Chromatin Immunoprecipitation ◽

Noise Distribution ◽

Least Squares Regression ◽

Peak Calling ◽

Background Signal ◽

Weighted Analysis ◽

Different Types ◽

Calling Algorithm ◽

Abstract Background Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Result We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusions This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.

ChIPdig: a comprehensive user-friendly tool for mining multi-sample ChIP-seq data

F1000Research ◽

10.12688/f1000research.20027.1 ◽

2019 ◽

Vol 8 ◽

pp. 1295 ◽

Author(s):

Ruben Esse

Keyword(s):

Data Analysis ◽

Enrichment Analysis ◽

Peak Calling ◽

Read Mapping ◽

Sequencing Technologies ◽

Genome Wide ◽

Wet Lab ◽

User Friendly ◽

Epigenetic Research

In recent years, epigenetic research has enjoyed explosive growth as high-throughput sequencing technologies become more accessible and affordable. However, this advancement has not been matched with similar progress in data analysis capabilities from the perspective of experimental biologists not versed in bioinformatic languages. For instance, chromatin immunoprecipitation followed by next-generation sequencing (ChIP-seq) is at present widely used to identify genomic loci of transcription factor binding and histone modifications. Basic ChIP-seq data analysis, including read mapping and peak calling, can be accomplished through several well-established tools, but more sophisticated analyzes aimed at comparing data derived from different conditions or experimental designs constitute a significant bottleneck. We reason that the implementation of a single comprehensive ChIP-seq analysis pipeline could be beneficial for many experimental (wet lab) researchers who would like to generate genomic data. Here we present ChIPdig, a stand-alone application with adjustable parameters designed to allow researchers to perform several analyzes, namely read mapping to a reference genome, peak calling, annotation of regions based on reference coordinates (e.g. transcription start and termination sites, exons, introns, and 5' and 3' untranslated regions), and generation of heatmaps and metaplots for visualizing coverage. Importantly, ChIPdig accepts multiple ChIP-seq datasets as input, allowing genome-wide differential enrichment analysis in regions of interest to be performed. ChIPdig is written in R and enables access to several existing and highly utilized packages through a simple user interface powered by the Shiny package. Here, we illustrate the utility and user-friendly features of ChIPdig by analyzing H3K36me3 and H3K4me3 ChIP-seq profiles generated by the modENCODE project as an example. ChIPdig offers a comprehensive and user-friendly pipeline for analysis of multiple sets of ChIP-seq data by both experimental and computational researchers. It is open source and available at https://github.com/rmesse/ChIPdig.

Genome-Wide DNA Alterations in X-Irradiated Human Gingiva Fibroblasts

International Journal of Molecular Sciences ◽

10.3390/ijms21165778 ◽

2020 ◽

Vol 21 (16) ◽

pp. 5778 ◽

Author(s):

Neetika Nath ◽

Lisa Hagenau ◽

Stefan Weiss ◽

Ana Tzvetkova ◽

Lars R. Jensen ◽

...

Keyword(s):

Genetic Material ◽

Medical Diagnostics ◽

Relative Increase ◽

Experimental Conditions ◽

Sequencing Technologies ◽

Genome Wide ◽

A Genome ◽

Human Gingiva ◽

Dna Alterations

While ionizing radiation (IR) is a powerful tool in medical diagnostics, nuclear medicine, and radiology, it also is a serious threat to the integrity of genetic material. Mutagenic effects of IR to the human genome have long been the subject of research, yet still comparatively little is known about the genome-wide effects of IR exposure on the DNA-sequence level. In this study, we employed high throughput sequencing technologies to investigate IR-induced DNA alterations in human gingiva fibroblasts (HGF) that were acutely exposed to 0.5, 2, and 10 Gy of 240 kV X-radiation followed by repair times of 16 h or 7 days before whole-genome sequencing (WGS). Our analysis of the obtained WGS datasets revealed patterns of IR-induced variant (SNV and InDel) accumulation across the genome, within chromosomes as well as around the borders of topologically associating domains (TADs). Chromosome 19 consistently accumulated the highest SNVs and InDels events. Translocations showed variable patterns but with recurrent chromosomes of origin (e.g., Chr7 and Chr16). IR-induced InDels showed a relative increase in number relative to SNVs and a characteristic signature with respect to the frequency of triplet deletions in areas without repetitive or microhomology features. Overall experimental conditions and datasets the majority of SNVs per genome had no or little predicted functional impact with a maximum of 62, showing damaging potential. A dose-dependent effect of IR was surprisingly not apparent. We also observed a significant reduction in transition/transversion (Ti/Tv) ratios for IR-dependent SNVs, which could point to a contribution of the mismatch repair (MMR) system that strongly favors the repair of transitions over transversions, to the IR-induced DNA-damage response in human cells. Taken together, our results show the presence of distinguishable characteristic patterns of IR-induced DNA-alterations on a genome-wide level and implicate DNA-repair mechanisms in the formation of these signatures.

ChIPdig: a comprehensive user-friendly tool for mining multi-sample ChIP-seq data

10.1101/220079 ◽

2017 ◽

Cited By ~ 2

Author(s):

Ruben Esse ◽

Alla Grishok

Keyword(s):

Data Analysis ◽

Enrichment Analysis ◽

Peak Calling ◽

Read Mapping ◽

Sequencing Technologies ◽

Genome Wide ◽

Wet Lab ◽

User Friendly ◽

Epigenetic Research

AbstractBackgroundIn recent years, epigenetic research has enjoyed explosive growth as high-throughput sequencing technologies become more accessible and affordable. However, this advancement has not been matched with similar progress in data analysis capabilities from the perspective of experimental biologists not versed in bioinformatic languages. For instance, chromatin immunoprecipitation followed by next-generation sequencing (ChIP-seq) is at present widely used to identify genomic loci of transcription factor binding and histone modifications. Basic ChIP-seq data analysis, including read mapping and peak calling, can be accomplished through several well-established tools, but more sophisticated analyzes aimed at comparing data derived from different conditions or experimental designs constitute a significant bottleneck. We reason that the implementation of a single comprehensive ChIP-seq analysis pipeline could be beneficial for many experimental (wet lab) researchers who would like to generate genomic data.ResultsHere we present ChIPdig, a stand-alone application with adjustable parameters designed to allow researchers to perform several analyzes, namely read mapping to a reference genome, peak calling, annotation of regions based on reference coordinates (e.g. transcription start and termination sites, exons, introns, 5′ UTRs and 3′ UTRs), and generation of heatmaps and metaplots for visualizing coverage. Importantly, ChIPdig accepts multiple ChIP-seq datasets as input, allowing genome-wide differential enrichment analysis in regions of interest to be performed. ChIPdig is written in R and enables access to several existing and highly utilized packages through a simple user interface powered by the Shiny package. Here, we illustrate the utility and user-friendly features of ChIPdig by analyzing H3K36me3 and H3K4me3 ChIP-seq profiles generated by the modENCODE project as an example.ConclusionsChIPdig offers a comprehensive and user-friendly pipeline for analysis of multiple sets of ChIP-seq data by both experimental and computational researchers. It is open source and available at https://github.com/rmesse/ChIPdig.

Unraveling the Genome of a High Yielding Colombian Sugarcane Hybrid

Frontiers in Plant Science ◽

10.3389/fpls.2021.694859 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jhon Henry Trujillo-Montenegro ◽

María Juliana Rodríguez Cubillos ◽

Cristian Darío Loaiza ◽

Manuel Quintero ◽

Héctor Fabio Espitia-Navarro ◽

...

Keyword(s):

Genome Assembly ◽

Saccharum Officinarum ◽

Entire Genome ◽

Protein Coding ◽

Sequencing Technologies ◽

Selection Processes ◽

A Genome ◽

Recent Developments ◽

Sugarcane Hybrids

Recent developments in High Throughput Sequencing (HTS) technologies and bioinformatics, including improved read lengths and genome assemblers allow the reconstruction of complex genomes with unprecedented quality and contiguity. Sugarcane has one of the most complicated genomes among grassess with a haploid length of 1Gbp and a ploidies between 8 and 12. In this work, we present a genome assembly of the Colombian sugarcane hybrid CC 01-1940. Three types of sequencing technologies were combined for this assembly: PacBio long reads, Illumina paired short reads, and Hi-C reads. We achieved a median contig length of 34.94 Mbp and a total genome assembly of 903.2 Mbp. We annotated a total of 63,724 protein coding genes and performed a reconstruction and comparative analysis of the sucrose metabolism pathway. Nucleotide evolution measurements between orthologs with close species suggest that divergence between Saccharum officinarum and Saccharum spontaneum occurred <2 million years ago. Synteny analysis between CC 01-1940 and the S. spontaneum genome confirms the presence of translocation events between the species and a random contribution throughout the entire genome in current sugarcane hybrids. Analysis of RNA-Seq data from leaf and root tissue of contrasting sugarcane genotypes subjected to water stress treatments revealed 17,490 differentially expressed genes, from which 3,633 correspond to genes expressed exclusively in tolerant genotypes. We expect the resources presented here to serve as a source of information to improve the selection processes of new varieties of the breeding programs of sugarcane.

Reassortment of Genome Segments Creates Stable Lineages Among Strains of Orchid Fleck Virus Infecting Citrus in Mexico

Phytopathology ◽

10.1094/phyto-07-19-0253-fi ◽

2020 ◽

Vol 110 (1) ◽

pp. 106-120 ◽

Author(s):

Avijit Roy ◽

Andrew L. Stone ◽

Gabriel Otero-Colina ◽

Gang Wei ◽

Ronald H. Brlansky ◽

...

Keyword(s):

Sensu Stricto ◽

Genome Segment ◽

Rt Pcr ◽

Sequence Comparisons ◽

Orchid Fleck Virus ◽

Reverse Transcription Pcr ◽

Sequencing Technologies ◽

Negative Sense

The genus Dichorhavirus contains viruses with bipartite, negative-sense, single-stranded RNA genomes that are transmitted by flat mites to hosts that include orchids, coffee, the genus Clerodendrum, and citrus. A dichorhavirus infecting citrus in Mexico is classified as a citrus strain of orchid fleck virus (OFV-Cit). We previously used RNA sequencing technologies on OFV-Cit samples from Mexico to develop an OFV-Cit–specific reverse transcription PCR (RT-PCR) assay. During assay validation, OFV-Cit–specific RT-PCR failed to produce an amplicon from some samples with clear symptoms of OFV-Cit. Characterization of this virus revealed that dichorhavirus-like particles were found in the nucleus. High-throughput sequencing of small RNAs from these citrus plants revealed a novel citrus strain of OFV, OFV-Cit2. Sequence comparisons with known orchid and citrus strains of OFV showed variation in the protein products encoded by genome segment 1 (RNA1). Strains of OFV clustered together based on host of origin, whether orchid or citrus, and were clearly separated from other dichorhaviruses described from infected citrus in Brazil. The variation in RNA1 between the original (now OFV-Cit1) and the new (OFV-Cit2) strain was not observed with genome segment 2 (RNA2), but instead, a common RNA2 molecule was shared among strains of OFV-Cit1 and -Cit2, a situation strikingly similar to OFV infecting orchids. We also collected mites at the affected groves, identified them as Brevipalpus californicus sensu stricto, and confirmed that they were infected by OFV-Cit1 or with both OFV-Cit1 and -Cit2. OFV-Cit1 and -Cit2 have coexisted at the same site in Toliman, Queretaro, Mexico since 2012. OFV strain-specific diagnostic tests were developed.