CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Download Full-text

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v1 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Computationally Efficient ◽

Highly Correlated ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.

Download Full-text

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v2 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Download Full-text

Bacterial Community Analysis of Mine Contaminated Soils in Hechi City

Journal of Biobased Materials and Bioenergy ◽

10.1166/jbmb.2020.1967 ◽

2020 ◽

Vol 14 (4) ◽

pp. 476-486

Author(s):

Tingting Liu ◽

Caoping Pang ◽

Fengcai Ye ◽

Dafei Gong ◽

Jieling Luo ◽

...

Keyword(s):

Microbial Community ◽

Genetic Engineering ◽

Bacterial Community ◽

Contaminated Soils ◽

High Throughput Sequencing ◽

Community Analysis ◽

Microbial Community Analysis ◽

Mine Soils ◽

Sequencing Technology ◽

Autonomous Region

Four mine contaminated soils located in northwest of Guangxi autonomous region were selected for microbial community analysis. These mine soils were contaminated by chromium (Cr) and cadmium (Cd). Microbial communities were described by high-throughput sequencing technology, which showed 39 different phyla in four samples. Among these phyla, Proteobacteria was the most abundant phylum in all samples. Acidobacteria, Actinobacteria, Planctomycetes, Firmicutes, Gemmatimonadetes, Bacteroidetes and Chloroflexi showed higher relative abundances than other phyla. In addition, a wide diversity of bacteria with the potential of bioremediation, such as Sphingomonas, Lysobacter and Gemmatimonas were detected in the tested mine contaminated soils. The results of microbial community analysis will provide a new target for isolation of microorganisms with the potential of bioremediation and lay the foundation for a great enhancement of bioremediation ability through the genetic engineering modification of indigenous microorganisms in future.

Download Full-text

Bioinformatic approaches for analysis of coral-associated bacteria using R programming language

Vietnam Journal of Biotechnology ◽

10.15625/1811-4989/18/4/15320 ◽

2021 ◽

Vol 18 (4) ◽

pp. 733-743

Author(s):

Doan Thi Nhung ◽

Bui Van Ngoc

Keyword(s):

Programming Language ◽

Community Analysis ◽

Microbial Community Analysis ◽

Metagenomic Data ◽

Rrna Gene ◽

Marine Microorganisms ◽

Taxonomic Assignment ◽

Associated Bacteria ◽

R Programming Language ◽

R Programming

Recent advances in metagenomics and bioinformatics allow the robust analysis of the composition and abundance of microbial communities, functional genes, and their metabolic pathways. So far, there has been a variety of computational/statistical tools or software for analyzing microbiome, the common problems that occurred in its implementation are, however, the lack of synchronization and compatibility of output/input data formats between such software. To overcome these challenges, in this study context, we aim to apply the DADA2 pipeline (written in R programming language) instead of using a set of different bioinformatics tools to create our own workflow for microbial community analysis in a continuous and synchronous manner. For the first effort, we tried to investigate the composition and abundance of coral-associated bacteria using their 16S rRNA gene amplicon sequences. The workflow or framework includes the following steps: data processing, sequence clustering, taxonomic assignment, and data visualization. Moreover, we also like to catch readers’ attention to the information about bacterial communities living in the ocean as most marine microorganisms are unculturable, especially residing in coral reefs, namely, bacteria are associated with the coral Acropora tenuis in this case. The outcomes obtained in this study suggest that the DADA2 pipeline written in R programming language is one of the potential bioinformatics approaches in the context of microbiome analysis other than using various software. Besides, our modifications for the workflow execution help researchers to illustrate metagenomic data more easily and systematically, elucidate the composition, abundance, diversity, and relationship between microorganism communities as well as to develop other bioinformatic tools more effectively.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

Perchlorate reduction by hydrogen autotrophic bacteria and microbial community analysis using high-throughput sequencing

Biodegradation ◽

10.1007/s10532-015-9754-1 ◽

2015 ◽

Vol 27 (1) ◽

pp. 47-57 ◽

Cited By ~ 16

Author(s):

Dongjin Wan ◽

Yongde Liu ◽

Zhenhua Niu ◽

Shuhu Xiao ◽

Daorong Li

Keyword(s):

Microbial Community ◽

High Throughput ◽

High Throughput Sequencing ◽

Community Analysis ◽

Microbial Community Analysis ◽

Perchlorate Reduction ◽

Autotrophic Bacteria

Download Full-text

Molecular Microbial Community Analysis as an Analysis Tool for Optimal Biogas Production

Microorganisms ◽

10.3390/microorganisms9061162 ◽

2021 ◽

Vol 9 (6) ◽

pp. 1162

Author(s):

Seyedbehnam Hashemi ◽

Sayed Ebrahim Hashemi ◽

Kristian M. Lien ◽

Jacob J. Lamb

Keyword(s):

Microbial Diversity ◽

High Throughput Sequencing ◽

Biogas Production ◽

Operating Time ◽

Community Analysis ◽

Microbial Community Analysis ◽

Analysis Tool ◽

High Resolution Data ◽

On Line ◽

Insight Into

The microbial diversity in anaerobic digestion (AD) is important because it affects process robustness. High-throughput sequencing offers high-resolution data regarding the microbial diversity and robustness of biological systems including AD; however, to understand the dynamics of microbial processes, knowing the microbial diversity is not adequate alone. Advanced meta-omic techniques have been established to determine the activity and interactions among organisms in biological processes like AD. Results of these methods can be used to identify biomarkers for AD states. This can aid a better understanding of system dynamics and be applied to producing comprehensive models for AD. The paper provides valuable knowledge regarding the possibility of integration of molecular methods in AD. Although meta-genomic methods are not suitable for on-line use due to long operating time and high costs, they provide extensive insight into the microbial phylogeny in AD. Meta-proteomics can also be explored in the demonstration projects for failure prediction. However, for these methods to be fully realised in AD, a biomarker database needs to be developed.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Ultrafast and accurate 16S microbial community analysis using Kraken 2

10.1101/2020.03.27.012047 ◽

2020 ◽

Cited By ~ 2

Author(s):

Jennifer Lu ◽

Steven L Salzberg

Keyword(s):

16S Rrna ◽

Ribosomal Rna ◽

Bacterial Species ◽

Community Analysis ◽

Microbial Community Analysis ◽

Metagenomic Data ◽

Human Gut ◽

Shotgun Metagenomics ◽

Primary Means ◽

Unknown Composition

AbstractFor decades, 16S ribosomal RNA sequencing has been the primary means for identifying the bacterial species present in a sample with unknown composition. One of the most widely-used tools for this purpose today is the QIIME (Quantitative Insights Into Microbial Ecology) package. Recent results have shown that the newest release, QIIME 2, has higher accuracy than QIIME, MAPseq, and mothur when classifying bacterial genera from simulated human gut, ocean, and soil metagenomes, although QIIME 2 also proved to be the most computationally expensive method. Kraken, first released in 2014, has been shown to provide exceptionally fast and accurate classification for shotgun metagenomics sequencing projects. Bracken, released in 2016, then provided users with the ability to accurately estimate species or genus abundances using Kraken classification results. Kraken 2, which matches the accuracy and speed of Kraken 1, now supports 16S rRNA databases, allowing for direct comparisons to QIIME and similar systems. Here we show that, using the same simulated 16S rRNA metagenomic data as previous studies, Kraken 2 and Bracken are up to 300 times faster and also more accurate at 16S profiling than QIIME 2.

Download Full-text