Cloud-BS: A MapReduce-based bisulfite sequencing aligner on cloud

In recent years, there have been many studies utilizing DNA methylome data to answer fundamental biological questions. Bisulfite sequencing (BS-seq) has enabled measurement of a genome-wide absolute level of DNA methylation at single-nucleotide resolution. However, due to the ambiguity introduced by bisulfite-treatment, the aligning process especially in large-scale epigenetic research is still considered a huge burden. We present Cloud-BS, an efficient BS-seq aligner designed for parallel execution on a distributed environment. Utilizing Apache Hadoop framework, Cloud-BS splits sequencing reads into multiple blocks and transfers them to distributed nodes. By designing each aligning procedure into separate map and reducing tasks while an internal key-value structure is optimized based on the MapReduce programming model, the algorithm significantly improves alignment performance without sacrificing mapping accuracy. In addition, Cloud-BS minimizes the innate burden of configuring a distributed environment by providing a pre-configured cloud image. Cloud-BS shows significantly improved bisulfite alignment performance compared to other existing BS-seq aligners. We believe our algorithm facilitates large-scale methylome data analysis. The algorithm is freely available at https://paryoja.github.io/Cloud-BS/ .

Download Full-text

Emerging Technologies for Genome-Wide Profiling of DNA Breakage

Frontiers in Genetics ◽

10.3389/fgene.2020.610386 ◽

2021 ◽

Vol 11 ◽

Author(s):

Matthew J. Rybin ◽

Melina Ramic ◽

Natalie R. Ricciardi ◽

Philipp Kapranov ◽

Claes Wahlestedt ◽

...

Keyword(s):

Genome Instability ◽

Dna Double Strand Breaks ◽

Single Nucleotide ◽

Strand Breaks ◽

Single Strand Breaks ◽

Genome Wide ◽

A Genome ◽

Wide Scale ◽

Nucleotide Resolution ◽

Genomic Regions

Genome instability is associated with myriad human diseases and is a well-known feature of both cancer and neurodegenerative disease. Until recently, the ability to assess DNA damage—the principal driver of genome instability—was limited to relatively imprecise methods or restricted to studying predefined genomic regions. Recently, new techniques for detecting DNA double strand breaks (DSBs) and single strand breaks (SSBs) with next-generation sequencing on a genome-wide scale with single nucleotide resolution have emerged. With these new tools, efforts are underway to define the “breakome” in normal aging and disease. Here, we compare the relative strengths and weaknesses of these technologies and their potential application to studying neurodegenerative diseases.

Download Full-text

Parallelization of a Commercial Streamline Simulator and Performance on Practical Models

SPE Reservoir Evaluation & Engineering ◽

10.2118/118684-pa ◽

2010 ◽

Vol 13 (03) ◽

pp. 383-390 ◽

Cited By ~ 5

Author(s):

R.P.. P. Batycky ◽

M.. Förster ◽

M.R.. R. Thiele ◽

K.. Stüben

Keyword(s):

Large Scale ◽

Programming Model ◽

Scaling Law ◽

Independent Solution ◽

Parallel Execution ◽

Water Model ◽

Test Machine ◽

Multicore Architectures ◽

Streamline Simulation ◽

Run Time

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.

Download Full-text

Phenotypic Screen and Transcriptomics Approach Complement Each Other in Functional Genomics of Defensive Stink Gland Physiology

10.21203/rs.3.rs-1117784/v1 ◽

2021 ◽

Author(s):

Sabrina Lehmann ◽

Bibi Atika ◽

Daniela Grossmann ◽

Christian Schmitt-Engel ◽

Nadi Strohlein ◽

...

Keyword(s):

Functional Genomics ◽

Large Scale ◽

Reverse Genetics ◽

Expression Profiles ◽

Forward Genetics ◽

Large Set ◽

Knock Down ◽

Genome Wide ◽

Phenotypic Screen ◽

A Genome

Abstract Background Functional genomics uses unbiased systematic genome-wide gene disruption or analyzes natural variations such as gene expression profiles of different tissues from multicellular organisms to link gene functions to particular phenotypes. Functional genomics approaches are of particular importance to identify large sets of genes that are specifically important for a particular biological process beyond known candidate genes, or when the process has not been studied with genetic methods before. Results Here, we present a large set of genes whose disruption interferes with the function of the odoriferous defensive stink glands of the red flour beetle Tribolium castaneum. This gene set is the result of a large-scale systematic phenotypic screen using a reverse genetics strategy based on RNA interference applied in a genome-wide forward genetics manner. In this first-pass screen, 130 genes were identified, of which 69 genes could be confirmed to cause knock-down gland phenotypes, which vary from necrotic tissue and irregular reservoir size to irregular color or separation of the secreted gland compounds. The knock-down of 13 genes caused specifically a strong reduction of para-benzoquinones, suggesting a specific function in the synthesis of these toxic compounds. Only 14 of the 69 confirmed gland genes are differentially overexpressed in stink gland tissue and thus could have been detected in a transcriptome-based analysis. Moreover, of the 29 previously transcriptomics-identified genes causing a gland phenotype, only one gene was recognized by this phenotypic screen despite the fact that 13 of them were covered by the screen. Conclusion Our results indicate the importance of combining diverse and independent methodologies to identify genes necessary for the function of a certain biological tissue, as the different approaches do not deliver redundant results but rather complement each other. The presented phenotypic screen together with a transcriptomics approach are now providing a set of close to hundred genes important for odoriferous defensive stink gland physiology in beetles.

Download Full-text

A New Genome-to-Genome Comparison Approach for Large-Scale Revisiting of Current Microbial Taxonomy

Microorganisms ◽

10.3390/microorganisms7060161 ◽

2019 ◽

Vol 7 (6) ◽

pp. 161 ◽

Cited By ~ 1

Author(s):

Ming-Hsin Tsai ◽

Yen-Yi Liu ◽

Von-Wun Soo ◽

Chih-Chieh Chen

Keyword(s):

Microbial Diversity ◽

Large Scale ◽

Gene Selection ◽

Marker Gene ◽

Genome Comparison ◽

Marker Genes ◽

Species Classification ◽

Genome Wide ◽

A Genome ◽

Comparison Approach

Microbial diversity has always presented taxonomic challenges. With the popularity of next-generation sequencing technology, more unculturable bacteria have been sequenced, facilitating the discovery of additional new species and complicated current microbial classification. The major challenge is to assign appropriate taxonomic names. Hence, assessing the consistency between taxonomy and genomic relatedness is critical. We proposed and applied a genome comparison approach to a large-scale survey to investigate the distribution of genomic differences among microorganisms. The approach applies a genome-wide criterion, homologous coverage ratio (HCR), for describing the homology between species. The survey included 7861 microbial genomes that excluded plasmids, and 1220 pairs of genera exhibited ambiguous classification. In this study, we also compared the performance of HCR and average nucleotide identity (ANI). The results indicated that HCR and ANI analyses yield comparable results, but a few examples suggested that HCR has a superior clustering effect. In addition, we used the Genome Taxonomy Database (GTDB), the gold standard for taxonomy, to validate our analysis. The GTDB offers 120 ubiquitous single-copy proteins as marker genes for species classification. We determined that the analysis of the GTDB still results in classification boundary blur between some genera and that the marker gene-based approach has limitations. Although the choice of marker genes has been quite rigorous, the bias of marker gene selection remains unavoidable. Therefore, methods based on genomic alignment should be considered for use for species classification in order to avoid the bias of marker gene selection. On the basis of our observations of microbial diversity, microbial classification should be re-examined using genome-wide comparisons.

Download Full-text

A Comprehensive Survey on the Terpene Synthase Gene Family Provides New Insight into Its Evolutionary Patterns

Genome Biology and Evolution ◽

10.1093/gbe/evz142 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2078-2098 ◽

Cited By ~ 8

Author(s):

Shu-Ye Jiang ◽

Jingjing Jin ◽

Rajani Sarojam ◽

Srinivasan Ramachandran

Keyword(s):

Gene Family ◽

Large Scale ◽

Family Members ◽

Terpene Synthase ◽

Limited Information ◽

Terpene Synthases ◽

Genome Wide ◽

A Genome ◽

Family Expansion ◽

Insight Into

Abstract Terpenes are organic compounds and play important roles in plant growth and development as well as in mediating interactions of plants with the environment. Terpene synthases (TPSs) are the key enzymes responsible for the biosynthesis of terpenes. Although some species were employed for the genome-wide identification and characterization of the TPS family, limited information is available regarding the evolution, expansion, and retention mechanisms occurring in this gene family. We performed a genome-wide identification of the TPS family members in 50 sequenced genomes. Additionally, we also characterized the TPS family from aromatic spearmint and basil plants using RNA-Seq data. No TPSs were identified in algae genomes but the remaining plant species encoded various numbers of the family members ranging from 2 to 79 full-length TPSs. Some species showed lineage-specific expansion of certain subfamilies, which might have contributed toward species or ecotype divergence or environmental adaptation. A large-scale family expansion was observed mainly in dicot and monocot plants, which was accompanied by frequent domain loss. Both tandem and segmental duplication significantly contributed toward family expansion and expression divergence and played important roles in the survival of these expanded genes. Our data provide new insight into the TPS family expansion and evolution and suggest that TPSs might have originated from isoprenyl diphosphate synthase genes.

Download Full-text

Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior

Science ◽

10.1126/science.aat7693 ◽

2019 ◽

Vol 365 (6456) ◽

pp. eaat7693 ◽

Cited By ~ 53

Author(s):

Andrea Ganna ◽

Karin J. H. Verweij ◽

Michel G. Nivard ◽

Robert Maier ◽

Robbee Wedow ◽

...

Keyword(s):

Sexual Behavior ◽

Genetic Architecture ◽

Large Scale ◽

Genome Wide Association Study ◽

Same Sex ◽

Genome Wide ◽

A Genome ◽

Number Of Sexual Partners ◽

Opposite Sex ◽

Males And Females

Twin and family studies have shown that same-sex sexual behavior is partly genetically influenced, but previous searches for specific genes involved have been underpowered. We performed a genome-wide association study (GWAS) on 477,522 individuals, revealing five loci significantly associated with same-sex sexual behavior. In aggregate, all tested genetic variants accounted for 8 to 25% of variation in same-sex sexual behavior, only partially overlapped between males and females, and do not allow meaningful prediction of an individual’s sexual behavior. Comparing these GWAS results with those for the proportion of same-sex to total number of sexual partners among nonheterosexuals suggests that there is no single continuum from opposite-sex to same-sex sexual behavior. Overall, our findings provide insights into the genetics underlying same-sex sexual behavior and underscore the complexity of sexuality.

Download Full-text

Efficient and accurate determination of genome-wide DNA methylation patterns in Arabidopsis thaliana with enzymatic methyl sequencing

Epigenetics & Chromatin ◽

10.1186/s13072-020-00361-9 ◽

2020 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Suhua Feng ◽

Zhenhui Zhong ◽

Ming Wang ◽

Steven E. Jacobsen

Keyword(s):

Dna Methylation ◽

Bisulfite Sequencing ◽

Accurate Determination ◽

Gc Content ◽

Epigenetic Mark ◽

Whole Genome ◽

Whole Genome Bisulfite Sequencing ◽

Genome Wide ◽

A Genome ◽

Genome Bisulfite Sequencing

Abstract Background 5′ methylation of cytosines in DNA molecules is an important epigenetic mark in eukaryotes. Bisulfite sequencing is the gold standard of DNA methylation detection, and whole-genome bisulfite sequencing (WGBS) has been widely used to detect methylation at single-nucleotide resolution on a genome-wide scale. However, sodium bisulfite is known to severely degrade DNA, which, in combination with biases introduced during PCR amplification, leads to unbalanced base representation in the final sequencing libraries. Enzymatic conversion of unmethylated cytosines to uracils can achieve the same end product for sequencing as does bisulfite treatment and does not affect the integrity of the DNA; enzymatic methylation sequencing may, thus, provide advantages over bisulfite sequencing. Results Using an enzymatic methyl-seq (EM-seq) technique to selectively deaminate unmethylated cytosines to uracils, we generated and sequenced libraries based on different amounts of Arabidopsis input DNA and different numbers of PCR cycles, and compared these data to results from traditional whole-genome bisulfite sequencing. We found that EM-seq libraries were more consistent between replicates and had higher mapping and lower duplication rates, lower background noise, higher average coverage, and higher coverage of total cytosines. Differential methylation region (DMR) analysis showed that WGBS tended to over-estimate methylation levels especially in CHG and CHH contexts, whereas EM-seq detected higher CG methylation levels in certain highly methylated areas. These phenomena can be mostly explained by a correlation of WGBS methylation estimation with GC content and methylated cytosine density. We used EM-seq to compare methylation between leaves and flowers, and found that CHG methylation level is greatly elevated in flowers, especially in pericentromeric regions. Conclusion We suggest that EM-seq is a more accurate and reliable approach than WGBS to detect methylation. Compared to WGBS, the results of EM-seq are less affected by differences in library preparation conditions or by the skewed base composition in the converted DNA. It may therefore be more desirable to use EM-seq in methylation studies.

Download Full-text

CNCC: an analysis tool to determine genome-wide DNA break end structure at single-nucleotide resolution

BMC Genomics ◽

10.1186/s12864-019-6436-0 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 3

Author(s):

Karol Szlachta ◽

Heather M. Raimer ◽

Laurey D. Comeau ◽

Yuh-Hwa Wang

Keyword(s):

Mapping Technique ◽

Nucleotide Position ◽

Analysis Tool ◽

Single Nucleotide ◽

Genome Wide ◽

A Genome ◽

Cross Correlation Analysis ◽

Dna Break ◽

Nucleotide Resolution ◽

Single Nucleotide Resolution

Abstract Background DNA double-stranded breaks (DSBs) are potentially deleterious events in a cell. The end structures (blunt, 3′- and 5′-overhangs) at DSB sites contribute to the fate of their repair and provide critical information concerning the consequences of the damage. Therefore, there has been a recent eruption of DNA break mapping and sequencing methods that aim to map at single-nucleotide resolution where breaks are generated genome-wide. These methods provide high resolution data for the location of DSBs, which can encode the type of end-structure present at these breaks. However, genome-wide analysis of the resulting end structures has not been investigated following these sequencing methods. Results To address this analysis gap, we develop the use of a coverage-normalized cross correlation analysis (CNCC) to process the high-precision genome-wide break mapping data, and determine genome-wide break end structure distributions at single-nucleotide resolution. We take advantage of the single-nucleotide position and the knowledge of strandness from every mapped break to analyze the relative shifts between positive and negative strand encoded break nucleotides. By applying CNCC we can identify the most abundant end structures captured by a break mapping technique, and further can make comparisons between different samples and treatments. We validate our analysis with restriction enzyme digestions of genomic DNA and establish the sensitivity of the analysis using end structures that only exist as a minor fraction of total breaks. Finally, we demonstrate the versatility of our analysis by applying CNCC to the breaks resulting after treatment with etoposide and study the variety of resulting end structures. Conclusion For the first time, on a genome-wide scale, our analysis revealed the increase in the 5′ to 3′ end resection following etoposide treatment, and the global progression of the resection. Furthermore, our method distinguished the change in the pattern of DSB end structure with increasing doses of the drug. The ability of this method to determine DNA break end structures without a priori knowledge of break sequences or genomic position should have broad applications in understanding genome instability.

Download Full-text

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1808833115 ◽

2018 ◽

Vol 116 (3) ◽

pp. 900-908 ◽

Cited By ~ 4

Author(s):

Hamutal Arbel ◽

Sumanta Basu ◽

William W. Fisher ◽

Ann S. Hammonds ◽

Kenneth H. Wan ◽

...

Keyword(s):

Large Scale ◽

Scale Validation ◽

Expression Patterns ◽

Prediction Method ◽

High Accuracy ◽

Genome Wide ◽

Rank List ◽

A Genome ◽

Improved Accuracy ◽

Genome Wide Scan

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.

Download Full-text

Structure and Complexity of a Bacterial Transcriptome

Journal of Bacteriology ◽

10.1128/jb.00122-09 ◽

2009 ◽

Vol 191 (10) ◽

pp. 3203-3211 ◽

Cited By ~ 155

Author(s):

Karla D. Passalacqua ◽

Anjana Varadarajan ◽

Brian D. Ondov ◽

David T. Okou ◽

Michael E. Zwick ◽

...

Keyword(s):

Gene Expression ◽

Bacterial Cell ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Transcript Abundance ◽

Growth Conditions ◽

A Genome ◽

Nucleotide Resolution ◽

Bacterial Transcriptome

ABSTRACT Although gene expression has been studied in bacteria for decades, many aspects of the bacterial transcriptome remain poorly understood. Transcript structure, operon linkages, and information on absolute abundance all provide valuable insights into gene function and regulation, but none has ever been determined on a genome-wide scale for any bacterium. Indeed, these aspects of the prokaryotic transcriptome have been explored on a large scale in only a few instances, and consequently little is known about the absolute composition of the mRNA population within a bacterial cell. Here we report the use of a high-throughput sequencing-based approach in assembling the first comprehensive, single-nucleotide resolution view of a bacterial transcriptome. We sampled the Bacillus anthracis transcriptome under a variety of growth conditions and showed that the data provide an accurate and high-resolution map of transcript start sites and operon structure throughout the genome. Further, the sequence data identified previously nonannotated regions with significant transcriptional activity and enhanced the accuracy of existing genome annotations. Finally, our data provide estimates of absolute transcript abundance and suggest that there is significant transcriptional heterogeneity within a clonal, synchronized bacterial population. Overall, our results offer an unprecedented view of gene expression and regulation in a bacterial cell.

Download Full-text