Bayesian Classification of Microbial Communities Based on 16S rRNA Metagenomic Data

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.

Download Full-text

NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa900 ◽

2020 ◽

Author(s):

Héctor Rodríguez-Pérez ◽

Laura Ciuffreda ◽

Carlos Flores

Keyword(s):

16S Rrna ◽

Species Level ◽

Supplementary Information ◽

Sequencing Data ◽

Abundance Profile ◽

Profile Estimation ◽

Mock Communities ◽

Better Than

Abstract Summary NanoCLUST is an analysis pipeline for the classification of amplicon-based full-length 16S rRNA nanopore reads. It is characterized by an unsupervised read clustering step, based on Uniform Manifold Approximation and Projection (UMAP), followed by the construction of a polished read and subsequent Blast classification. Here, we demonstrate that NanoCLUST performs better than other state-of-the-art software in the characterization of two commercial mock communities, enabling accurate bacterial identification and abundance profile estimation at species-level resolution. Availability and implementation Source code, test data and documentation of NanoCLUST are freely available at https://github.com/genomicsITER/NanoCLUST under MIT License. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Reevaluating the Salty Divide: Phylogenetic Specificity of Transitions between Marine and Freshwater Systems

mSystems ◽

10.1128/msystems.00232-18 ◽

2018 ◽

Vol 3 (6) ◽

Cited By ~ 10

Author(s):

Sara F. Paver ◽

Daniel Muratore ◽

Ryan J. Newton ◽

Maureen L. Coleman

Keyword(s):

16S Rrna ◽

Microbial Communities ◽

Relative Abundance ◽

Environmental Changes ◽

Habitat Type ◽

Freshwater Ecosystems ◽

Rrna Gene ◽

Data Sets ◽

Sequencing Data ◽

Habitat Types

ABSTRACTMarine and freshwater microbial communities are phylogenetically distinct, and transitions between habitat types are thought to be infrequent. We compared the phylogenetic diversity of marine and freshwater microorganisms and identified specific lineages exhibiting notably high or low similarity between marine and freshwater ecosystems using a meta-analysis of 16S rRNA gene tag-sequencing data sets. As expected, marine and freshwater microbial communities differed in the relative abundance of major phyla and contained habitat-specific lineages. At the same time, and contrary to expectations, many shared taxa were observed in both habitats. Based on several metrics, we found thatGammaproteobacteria,Alphaproteobacteria,Bacteroidetes, andBetaproteobacteriacontained the highest number of closely related marine and freshwater sequences, suggesting comparatively recent habitat transitions in these groups. Using the abundant alphaproteobacterial group SAR11 as an example, we found evidence that new lineages, beyond the recognized LD12 clade, are detected in freshwater at low but reproducible abundances; this evidence extends beyond the 16S rRNA locus to core genes throughout the genome. Our results suggest that shared taxa are numerous, but tend to occur sporadically and at low relative abundance in one habitat type, leading to an underestimation of transition frequency between marine and freshwater habitats. Rare taxa with abundances near or below detection, including lineages that appear to have crossed the salty divide relatively recently, may possess adaptations enabling them to exploit opportunities for niche expansion when environments are disturbed or conditions change.IMPORTANCEThe distribution of microbial diversity across environments yields insight into processes that create and maintain this diversity as well as potential to infer how communities will respond to future environmental changes. We integrated data sets from dozens of freshwater lake and marine samples to compare diversity across open water habitats differing in salinity. Our novel combination of sequence-based approaches revealed lineages that likely experienced a recent transition across habitat types. These taxa are promising targets for studying physiological constraints on salinity tolerance. Our findings contribute to understanding the ecological and evolutionary controls on microbial distributions, and open up new questions regarding the plasticity and adaptability of particular lineages.

Download Full-text

NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data

10.1101/2020.05.14.087353 ◽

2020 ◽

Cited By ~ 1

Author(s):

Héctor Rodríguez-Pérez ◽

Laura Ciuffreda ◽

Carlos Flores

Keyword(s):

16S Rrna ◽

Species Level ◽

Sequencing Data ◽

Abundance Profile ◽

Profile Estimation ◽

Level Analysis ◽

Mock Communities ◽

Better Than

AbstractSummaryNanoCLUST is an analysis pipeline for classification of amplicon-based full-length 16S rRNA nanopore reads. It is characterized by an unsupervised read clustering step, based on Uniform Manifold Approximation and Projection (UMAP), followed by the construction of a polished read and subsequent Blast classification. Here we demonstrate that NanoCLUST performs better than other state-of-the-art software in the characterization of two commercial mock communities, enabling accurate bacterial identification and abundance profile estimation at species level resolution.Availability and implementationSource code, test data and documentation of NanoCLUST is freely available at https://github.com/genomicsITER/NanoCLUST under MIT [email protected]

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Dynamics of Soil Microbial Communities During Diazepam and Oxazepam Biodegradation in Soil Flooded by Water From a WWTP

Frontiers in Microbiology ◽

10.3389/fmicb.2021.742000 ◽

2021 ◽

Vol 12 ◽

Author(s):

Marc Crampon ◽

Coralie Soulier ◽

Pauline Sidoli ◽

Jennifer Hellal ◽

Catherine Joulian ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Microbial Communities ◽

Soil Microbial Communities ◽

Gene Sequencing ◽

16S Rrna Gene Sequencing ◽

Treated Wastewater ◽

Rrna Gene ◽

Sequencing Data ◽

Rrna Gene Sequencing

The demand for energy and chemicals is constantly growing, leading to an increase of the amounts of contaminants discharged to the environment. Among these, pharmaceutical molecules are frequently found in treated wastewater that is discharged into superficial waters. Indeed, wastewater treatment plants (WWTPs) are designed to remove organic pollution from urban effluents but are not specific, especially toward contaminants of emerging concern (CECs), which finally reach the natural environment. In this context, it is important to study the fate of micropollutants, especially in a soil aquifer treatment (SAT) context for water from WWTPs, and for the most persistent molecules such as benzodiazepines. In the present study, soils sampled in a reed bed frequently flooded by water from a WWTP were spiked with diazepam and oxazepam in microcosms, and their concentrations were monitored for 97 days. It appeared that the two molecules were completely degraded after 15 days of incubation. Samples were collected during the experiment in order to follow the dynamics of the microbial communities, based on 16S rRNA gene sequencing for Archaea and Bacteria, and ITS2 gene for Fungi. The evolution of diversity and of specific operating taxonomic units (OTUs) highlighted an impact of the addition of benzodiazepines, a rapid resilience of the fungal community and an evolution of the bacterial community. It appeared that OTUs from the Brevibacillus genus were more abundant at the beginning of the biodegradation process, for diazepam and oxazepam conditions. Additionally, Tax4Fun tool was applied to 16S rRNA gene sequencing data to infer on the evolution of specific metabolic functions during biodegradation. It finally appeared that the microbial community in soils frequently exposed to water from WWTP, potentially containing CECs such as diazepam and oxazepam, may be adapted to the degradation of persistent contaminants.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

Mumame: a software tool for quantifying gene-specific point-mutations in shotgun metagenomic data

Metabarcoding and Metagenomics ◽

10.3897/mbmg.3.36236 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Shruthi Magesh ◽

Viktor Jonsson ◽

Johan Bengtsson-Palme

Keyword(s):

Microbial Communities ◽

Point Mutations ◽

Software Tool ◽

Metagenomic Data ◽

Data Sets ◽

Resistance Mutations ◽

Shotgun Metagenomics ◽

Key Factor ◽

Detection Of Mutations ◽

And Function

Metagenomics has emerged as a central technique for studying the structure and function of microbial communities. Often the functional analysis is restricted to classification into broad functional categories. However, important phenotypic differences, such as resistance to antibiotics, are often the result of just one or a few point mutations in otherwise identical sequences. Bioinformatic methods for metagenomic analysis have generally been poor at accounting for this fact, resulting in a somewhat limited picture of important aspects of microbial communities. Here, we address this problem by providing a software tool called Mumame, which can distinguish between wildtype and mutated sequences in shotgun metagenomic data and quantify their relative abundances. We demonstrate the utility of the tool by quantifying antibiotic resistance mutations in several publicly available metagenomic data sets. We also identified that sequencing depth is a key factor to detect rare mutations. Therefore, much larger numbers of sequences may be required for reliable detection of mutations than for most other applications of shotgun metagenomics. Mumame is freely available online (http://microbiology.se/software/mumame).

Download Full-text

Characterization of the depth-related changes in the microbial communities in Lake Hovsgol sediment by 16S rRNA gene-based approaches

The Journal of Microbiology ◽

10.1007/s12275-007-0189-1 ◽

2008 ◽

Vol 46 (2) ◽

pp. 125-136 ◽

Cited By ~ 20

Author(s):

Young-Do Nam ◽

Youlboong Sung ◽

Ho-Won Chang ◽

Seong Woon Roh ◽

Kyoung-Ho Kim ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Microbial Communities ◽

Rrna Gene ◽

Lake Hovsgol

Download Full-text

Highly accurate long-read HiFi sequencing data for five complex genomes

Scientific Data ◽

10.1038/s41597-020-00743-4 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Ting Hon ◽

Kristin Mars ◽

Greg Young ◽

Yu-Chih Tsai ◽

Joseph W. Karalius ◽

...

Keyword(s):

Sequence Data ◽

Genome Structure ◽

Data Sets ◽

Sequencing Data ◽

Complex Samples ◽

Bioinformatic Tools ◽

Long Reads ◽

Sequencing Method ◽

Sample Data ◽

Long Read

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Download Full-text