Cloud-BS: A MapReduce-based bisulfite sequencing aligner on cloud

2018 ◽  
Vol 16 (06) ◽  
pp. 1840028 ◽  
Author(s):  
Joungmin Choi ◽  
Yoonjae Park ◽  
Sun Kim ◽  
Heejoon Chae

In recent years, there have been many studies utilizing DNA methylome data to answer fundamental biological questions. Bisulfite sequencing (BS-seq) has enabled measurement of a genome-wide absolute level of DNA methylation at single-nucleotide resolution. However, due to the ambiguity introduced by bisulfite-treatment, the aligning process especially in large-scale epigenetic research is still considered a huge burden. We present Cloud-BS, an efficient BS-seq aligner designed for parallel execution on a distributed environment. Utilizing Apache Hadoop framework, Cloud-BS splits sequencing reads into multiple blocks and transfers them to distributed nodes. By designing each aligning procedure into separate map and reducing tasks while an internal key-value structure is optimized based on the MapReduce programming model, the algorithm significantly improves alignment performance without sacrificing mapping accuracy. In addition, Cloud-BS minimizes the innate burden of configuring a distributed environment by providing a pre-configured cloud image. Cloud-BS shows significantly improved bisulfite alignment performance compared to other existing BS-seq aligners. We believe our algorithm facilitates large-scale methylome data analysis. The algorithm is freely available at https://paryoja.github.io/Cloud-BS/ .

2021 ◽  
Vol 11 ◽  
Author(s):  
Matthew J. Rybin ◽  
Melina Ramic ◽  
Natalie R. Ricciardi ◽  
Philipp Kapranov ◽  
Claes Wahlestedt ◽  
...  

Genome instability is associated with myriad human diseases and is a well-known feature of both cancer and neurodegenerative disease. Until recently, the ability to assess DNA damage—the principal driver of genome instability—was limited to relatively imprecise methods or restricted to studying predefined genomic regions. Recently, new techniques for detecting DNA double strand breaks (DSBs) and single strand breaks (SSBs) with next-generation sequencing on a genome-wide scale with single nucleotide resolution have emerged. With these new tools, efforts are underway to define the “breakome” in normal aging and disease. Here, we compare the relative strengths and weaknesses of these technologies and their potential application to studying neurodegenerative diseases.


2010 ◽  
Vol 13 (03) ◽  
pp. 383-390 ◽  
Author(s):  
R.P.. P. Batycky ◽  
M.. Förster ◽  
M.R.. R. Thiele ◽  
K.. Stüben

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.


2021 ◽  
Author(s):  
Sabrina Lehmann ◽  
Bibi Atika ◽  
Daniela Grossmann ◽  
Christian Schmitt-Engel ◽  
Nadi Strohlein ◽  
...  

Abstract Background Functional genomics uses unbiased systematic genome-wide gene disruption or analyzes natural variations such as gene expression profiles of different tissues from multicellular organisms to link gene functions to particular phenotypes. Functional genomics approaches are of particular importance to identify large sets of genes that are specifically important for a particular biological process beyond known candidate genes, or when the process has not been studied with genetic methods before. Results Here, we present a large set of genes whose disruption interferes with the function of the odoriferous defensive stink glands of the red flour beetle Tribolium castaneum. This gene set is the result of a large-scale systematic phenotypic screen using a reverse genetics strategy based on RNA interference applied in a genome-wide forward genetics manner. In this first-pass screen, 130 genes were identified, of which 69 genes could be confirmed to cause knock-down gland phenotypes, which vary from necrotic tissue and irregular reservoir size to irregular color or separation of the secreted gland compounds. The knock-down of 13 genes caused specifically a strong reduction of para-benzoquinones, suggesting a specific function in the synthesis of these toxic compounds. Only 14 of the 69 confirmed gland genes are differentially overexpressed in stink gland tissue and thus could have been detected in a transcriptome-based analysis. Moreover, of the 29 previously transcriptomics-identified genes causing a gland phenotype, only one gene was recognized by this phenotypic screen despite the fact that 13 of them were covered by the screen. Conclusion Our results indicate the importance of combining diverse and independent methodologies to identify genes necessary for the function of a certain biological tissue, as the different approaches do not deliver redundant results but rather complement each other. The presented phenotypic screen together with a transcriptomics approach are now providing a set of close to hundred genes important for odoriferous defensive stink gland physiology in beetles.


2019 ◽  
Vol 7 (6) ◽  
pp. 161 ◽  
Author(s):  
Ming-Hsin Tsai ◽  
Yen-Yi Liu ◽  
Von-Wun Soo ◽  
Chih-Chieh Chen

Microbial diversity has always presented taxonomic challenges. With the popularity of next-generation sequencing technology, more unculturable bacteria have been sequenced, facilitating the discovery of additional new species and complicated current microbial classification. The major challenge is to assign appropriate taxonomic names. Hence, assessing the consistency between taxonomy and genomic relatedness is critical. We proposed and applied a genome comparison approach to a large-scale survey to investigate the distribution of genomic differences among microorganisms. The approach applies a genome-wide criterion, homologous coverage ratio (HCR), for describing the homology between species. The survey included 7861 microbial genomes that excluded plasmids, and 1220 pairs of genera exhibited ambiguous classification. In this study, we also compared the performance of HCR and average nucleotide identity (ANI). The results indicated that HCR and ANI analyses yield comparable results, but a few examples suggested that HCR has a superior clustering effect. In addition, we used the Genome Taxonomy Database (GTDB), the gold standard for taxonomy, to validate our analysis. The GTDB offers 120 ubiquitous single-copy proteins as marker genes for species classification. We determined that the analysis of the GTDB still results in classification boundary blur between some genera and that the marker gene-based approach has limitations. Although the choice of marker genes has been quite rigorous, the bias of marker gene selection remains unavoidable. Therefore, methods based on genomic alignment should be considered for use for species classification in order to avoid the bias of marker gene selection. On the basis of our observations of microbial diversity, microbial classification should be re-examined using genome-wide comparisons.


2019 ◽  
Vol 11 (8) ◽  
pp. 2078-2098 ◽  
Author(s):  
Shu-Ye Jiang ◽  
Jingjing Jin ◽  
Rajani Sarojam ◽  
Srinivasan Ramachandran

Abstract Terpenes are organic compounds and play important roles in plant growth and development as well as in mediating interactions of plants with the environment. Terpene synthases (TPSs) are the key enzymes responsible for the biosynthesis of terpenes. Although some species were employed for the genome-wide identification and characterization of the TPS family, limited information is available regarding the evolution, expansion, and retention mechanisms occurring in this gene family. We performed a genome-wide identification of the TPS family members in 50 sequenced genomes. Additionally, we also characterized the TPS family from aromatic spearmint and basil plants using RNA-Seq data. No TPSs were identified in algae genomes but the remaining plant species encoded various numbers of the family members ranging from 2 to 79 full-length TPSs. Some species showed lineage-specific expansion of certain subfamilies, which might have contributed toward species or ecotype divergence or environmental adaptation. A large-scale family expansion was observed mainly in dicot and monocot plants, which was accompanied by frequent domain loss. Both tandem and segmental duplication significantly contributed toward family expansion and expression divergence and played important roles in the survival of these expanded genes. Our data provide new insight into the TPS family expansion and evolution and suggest that TPSs might have originated from isoprenyl diphosphate synthase genes.


Science ◽  
2019 ◽  
Vol 365 (6456) ◽  
pp. eaat7693 ◽  
Author(s):  
Andrea Ganna ◽  
Karin J. H. Verweij ◽  
Michel G. Nivard ◽  
Robert Maier ◽  
Robbee Wedow ◽  
...  

Twin and family studies have shown that same-sex sexual behavior is partly genetically influenced, but previous searches for specific genes involved have been underpowered. We performed a genome-wide association study (GWAS) on 477,522 individuals, revealing five loci significantly associated with same-sex sexual behavior. In aggregate, all tested genetic variants accounted for 8 to 25% of variation in same-sex sexual behavior, only partially overlapped between males and females, and do not allow meaningful prediction of an individual’s sexual behavior. Comparing these GWAS results with those for the proportion of same-sex to total number of sexual partners among nonheterosexuals suggests that there is no single continuum from opposite-sex to same-sex sexual behavior. Overall, our findings provide insights into the genetics underlying same-sex sexual behavior and underscore the complexity of sexuality.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Suhua Feng ◽  
Zhenhui Zhong ◽  
Ming Wang ◽  
Steven E. Jacobsen

Abstract Background 5′ methylation of cytosines in DNA molecules is an important epigenetic mark in eukaryotes. Bisulfite sequencing is the gold standard of DNA methylation detection, and whole-genome bisulfite sequencing (WGBS) has been widely used to detect methylation at single-nucleotide resolution on a genome-wide scale. However, sodium bisulfite is known to severely degrade DNA, which, in combination with biases introduced during PCR amplification, leads to unbalanced base representation in the final sequencing libraries. Enzymatic conversion of unmethylated cytosines to uracils can achieve the same end product for sequencing as does bisulfite treatment and does not affect the integrity of the DNA; enzymatic methylation sequencing may, thus, provide advantages over bisulfite sequencing. Results Using an enzymatic methyl-seq (EM-seq) technique to selectively deaminate unmethylated cytosines to uracils, we generated and sequenced libraries based on different amounts of Arabidopsis input DNA and different numbers of PCR cycles, and compared these data to results from traditional whole-genome bisulfite sequencing. We found that EM-seq libraries were more consistent between replicates and had higher mapping and lower duplication rates, lower background noise, higher average coverage, and higher coverage of total cytosines. Differential methylation region (DMR) analysis showed that WGBS tended to over-estimate methylation levels especially in CHG and CHH contexts, whereas EM-seq detected higher CG methylation levels in certain highly methylated areas. These phenomena can be mostly explained by a correlation of WGBS methylation estimation with GC content and methylated cytosine density. We used EM-seq to compare methylation between leaves and flowers, and found that CHG methylation level is greatly elevated in flowers, especially in pericentromeric regions. Conclusion We suggest that EM-seq is a more accurate and reliable approach than WGBS to detect methylation. Compared to WGBS, the results of EM-seq are less affected by differences in library preparation conditions or by the skewed base composition in the converted DNA. It may therefore be more desirable to use EM-seq in methylation studies.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Karol Szlachta ◽  
Heather M. Raimer ◽  
Laurey D. Comeau ◽  
Yuh-Hwa Wang

Abstract Background DNA double-stranded breaks (DSBs) are potentially deleterious events in a cell. The end structures (blunt, 3′- and 5′-overhangs) at DSB sites contribute to the fate of their repair and provide critical information concerning the consequences of the damage. Therefore, there has been a recent eruption of DNA break mapping and sequencing methods that aim to map at single-nucleotide resolution where breaks are generated genome-wide. These methods provide high resolution data for the location of DSBs, which can encode the type of end-structure present at these breaks. However, genome-wide analysis of the resulting end structures has not been investigated following these sequencing methods. Results To address this analysis gap, we develop the use of a coverage-normalized cross correlation analysis (CNCC) to process the high-precision genome-wide break mapping data, and determine genome-wide break end structure distributions at single-nucleotide resolution. We take advantage of the single-nucleotide position and the knowledge of strandness from every mapped break to analyze the relative shifts between positive and negative strand encoded break nucleotides. By applying CNCC we can identify the most abundant end structures captured by a break mapping technique, and further can make comparisons between different samples and treatments. We validate our analysis with restriction enzyme digestions of genomic DNA and establish the sensitivity of the analysis using end structures that only exist as a minor fraction of total breaks. Finally, we demonstrate the versatility of our analysis by applying CNCC to the breaks resulting after treatment with etoposide and study the variety of resulting end structures. Conclusion For the first time, on a genome-wide scale, our analysis revealed the increase in the 5′ to 3′ end resection following etoposide treatment, and the global progression of the resection. Furthermore, our method distinguished the change in the pattern of DSB end structure with increasing doses of the drug. The ability of this method to determine DNA break end structures without a priori knowledge of break sequences or genomic position should have broad applications in understanding genome instability.


2018 ◽  
Vol 116 (3) ◽  
pp. 900-908 ◽  
Author(s):  
Hamutal Arbel ◽  
Sumanta Basu ◽  
William W. Fisher ◽  
Ann S. Hammonds ◽  
Kenneth H. Wan ◽  
...  

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.


2009 ◽  
Vol 191 (10) ◽  
pp. 3203-3211 ◽  
Author(s):  
Karla D. Passalacqua ◽  
Anjana Varadarajan ◽  
Brian D. Ondov ◽  
David T. Okou ◽  
Michael E. Zwick ◽  
...  

ABSTRACT Although gene expression has been studied in bacteria for decades, many aspects of the bacterial transcriptome remain poorly understood. Transcript structure, operon linkages, and information on absolute abundance all provide valuable insights into gene function and regulation, but none has ever been determined on a genome-wide scale for any bacterium. Indeed, these aspects of the prokaryotic transcriptome have been explored on a large scale in only a few instances, and consequently little is known about the absolute composition of the mRNA population within a bacterial cell. Here we report the use of a high-throughput sequencing-based approach in assembling the first comprehensive, single-nucleotide resolution view of a bacterial transcriptome. We sampled the Bacillus anthracis transcriptome under a variety of growth conditions and showed that the data provide an accurate and high-resolution map of transcript start sites and operon structure throughout the genome. Further, the sequence data identified previously nonannotated regions with significant transcriptional activity and enhanced the accuracy of existing genome annotations. Finally, our data provide estimates of absolute transcript abundance and suggest that there is significant transcriptional heterogeneity within a clonal, synchronized bacterial population. Overall, our results offer an unprecedented view of gene expression and regulation in a bacterial cell.


Sign in / Sign up

Export Citation Format

Share Document