scholarly journals CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

2019 ◽  
Author(s):  
Vanessa R. Marcelino ◽  
Philip T.L.C. Clausen ◽  
Jan P. Buchmann ◽  
Michelle Wille ◽  
Jonathan R. Iredell ◽  
...  

AbstractHigh-throughput sequencing of DNA and RNA from environmental and host-associated samples (metagenomics and metatranscriptomics) is a powerful tool to assess which organisms are present in a sample. Taxonomic identification software usually align individual short sequence reads to a reference database, sometimes containing taxa with complete genomes only. This is a challenging task given that different species can share identical sequence regions and complete genome sequences are only available for a fraction of organisms. A recently developed approach to map sequence reads to reference databases involves weighing all high scoring read-mappings to the data base as a whole to produce better-informed alignments. We used this novel concept in read mapping to develop a highly accurate metagenomic classification pipeline named CCMetagen. Using simulated fungal and bacterial metagenomes, we demonstrate that CCMetagen substantially outperforms other commonly used metagenome classifiers, attaining a 3 – 1580 fold increase in precision and a 2 – 922 fold increase in F1 scores for species-level classifications when compared to Kraken2, Centrifuge and KrakenUniq. CCMetagen is sufficiently fast and memory efficient to use the entire NCBI nucleotide collection (nt) as reference, enabling the assessment of species with incomplete genome sequence data from all biological kingdoms. Our pipeline efficiently produced a comprehensive overview of the microbiome of two biological data sets, including both eukaryotes and prokaryotes. CCMetagen is user-friendly and the results can be easily integrated into microbial community analysis software for streamlined and automated microbiome studies.

2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.


2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.


2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.


2020 ◽  
Vol 14 (4) ◽  
pp. 476-486
Author(s):  
Tingting Liu ◽  
Caoping Pang ◽  
Fengcai Ye ◽  
Dafei Gong ◽  
Jieling Luo ◽  
...  

Four mine contaminated soils located in northwest of Guangxi autonomous region were selected for microbial community analysis. These mine soils were contaminated by chromium (Cr) and cadmium (Cd). Microbial communities were described by high-throughput sequencing technology, which showed 39 different phyla in four samples. Among these phyla, Proteobacteria was the most abundant phylum in all samples. Acidobacteria, Actinobacteria, Planctomycetes, Firmicutes, Gemmatimonadetes, Bacteroidetes and Chloroflexi showed higher relative abundances than other phyla. In addition, a wide diversity of bacteria with the potential of bioremediation, such as Sphingomonas, Lysobacter and Gemmatimonas were detected in the tested mine contaminated soils. The results of microbial community analysis will provide a new target for isolation of microorganisms with the potential of bioremediation and lay the foundation for a great enhancement of bioremediation ability through the genetic engineering modification of indigenous microorganisms in future.


2021 ◽  
Vol 18 (4) ◽  
pp. 733-743
Author(s):  
Doan Thi Nhung ◽  
Bui Van Ngoc

Recent advances in metagenomics and bioinformatics allow the robust analysis of the composition and abundance of microbial communities, functional genes, and their metabolic pathways. So far, there has been a variety of computational/statistical tools or software for analyzing microbiome, the common problems that occurred in its implementation are, however, the lack of synchronization and compatibility of output/input data formats between such software. To overcome these challenges, in this study context, we aim to apply the DADA2 pipeline (written in R programming language) instead of using a set of different bioinformatics tools to create our own workflow for microbial community analysis in a continuous and synchronous manner. For the first effort, we tried to investigate the composition and abundance of coral-associated bacteria using their 16S rRNA gene amplicon sequences. The workflow or framework includes the following steps: data processing, sequence clustering, taxonomic assignment, and data visualization. Moreover, we also like to catch readers’ attention to the information about bacterial communities living in the ocean as most marine microorganisms are unculturable, especially residing in coral reefs, namely, bacteria are associated with the coral Acropora tenuis in this case. The outcomes obtained in this study suggest that the DADA2 pipeline written in R programming language is one of the potential bioinformatics approaches in the context of microbiome analysis other than using various software. Besides, our modifications for the workflow execution help researchers to illustrate metagenomic data more easily and systematically, elucidate the composition, abundance, diversity, and relationship between microorganism communities as well as to develop other bioinformatic tools more effectively.


2016 ◽  
Author(s):  
Shea N Gardner ◽  
Sasha K Ames ◽  
Maya B Gokhale ◽  
Tom R Slezak ◽  
Jonathan Allen

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.


2021 ◽  
Vol 9 (6) ◽  
pp. 1162
Author(s):  
Seyedbehnam Hashemi ◽  
Sayed Ebrahim Hashemi ◽  
Kristian M. Lien ◽  
Jacob J. Lamb

The microbial diversity in anaerobic digestion (AD) is important because it affects process robustness. High-throughput sequencing offers high-resolution data regarding the microbial diversity and robustness of biological systems including AD; however, to understand the dynamics of microbial processes, knowing the microbial diversity is not adequate alone. Advanced meta-omic techniques have been established to determine the activity and interactions among organisms in biological processes like AD. Results of these methods can be used to identify biomarkers for AD states. This can aid a better understanding of system dynamics and be applied to producing comprehensive models for AD. The paper provides valuable knowledge regarding the possibility of integration of molecular methods in AD. Although meta-genomic methods are not suitable for on-line use due to long operating time and high costs, they provide extensive insight into the microbial phylogeny in AD. Meta-proteomics can also be explored in the demonstration projects for failure prediction. However, for these methods to be fully realised in AD, a biomarker database needs to be developed.


Author(s):  
Taylor Reiter ◽  
Phillip T. Brooks ◽  
Luiz Irber ◽  
Shannon E.K. Joslin ◽  
Charles M. Reid ◽  
...  

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.


Author(s):  
Jennifer Lu ◽  
Steven L Salzberg

AbstractFor decades, 16S ribosomal RNA sequencing has been the primary means for identifying the bacterial species present in a sample with unknown composition. One of the most widely-used tools for this purpose today is the QIIME (Quantitative Insights Into Microbial Ecology) package. Recent results have shown that the newest release, QIIME 2, has higher accuracy than QIIME, MAPseq, and mothur when classifying bacterial genera from simulated human gut, ocean, and soil metagenomes, although QIIME 2 also proved to be the most computationally expensive method. Kraken, first released in 2014, has been shown to provide exceptionally fast and accurate classification for shotgun metagenomics sequencing projects. Bracken, released in 2016, then provided users with the ability to accurately estimate species or genus abundances using Kraken classification results. Kraken 2, which matches the accuracy and speed of Kraken 1, now supports 16S rRNA databases, allowing for direct comparisons to QIIME and similar systems. Here we show that, using the same simulated 16S rRNA metagenomic data as previous studies, Kraken 2 and Bracken are up to 300 times faster and also more accurate at 16S profiling than QIIME 2.


Sign in / Sign up

Export Citation Format

Share Document