Differential richness inference for 16S rRNA marker gene surveys

Individual and environmental health outcomes are frequently linked to changes in the diversity of associated microbial communities. This makes deriving health indicators based on microbiome diversity measures essential. While microbiome data generated using high throughput 16S rRNA marker gene surveys are appealing for this purpose, 16S surveys also generate a plethora of spurious microbial taxa. When this artificial inflation in the observed number of taxa (i.e., richness, a diversity measure) is ignored, we find that changes in the abundance of detected taxa confound current methods for inferring differences in richness. Here we argue that the evidence of our own experiments, theory guided exploratory data analyses and existing literature, support the conclusion that most sub-genus discoveries are spurious artifacts of clustering 16S sequencing reads. We proceed based on this finding to model a 16S survey's systematic patterns of sub-genus taxa generation as a function of genus abundance to derive a robust control for false taxa accumulation. Such controls unlock classical regression approaches for highly flexible differential richness inference at various levels of the surveyed microbial assemblage: from sample groups to specific taxa collections. The proposed methodology for differential richness inference is available through an R package, Prokounter. Package availability: https://github.com/mskb01/prokounter

Download Full-text

Uncovering thematic structure to link co-occurring taxa and predicted functional content in 16S rRNA marker gene surveys

10.1101/146126 ◽

2017 ◽

Cited By ~ 3

Author(s):

Stephen Woloszynek ◽

Joshua Chang Mell ◽

Gideon Simpson ◽

Michael P. O’Connor ◽

Gail L. Rosen

Keyword(s):

16S Rrna ◽

Topic Model ◽

Marker Gene ◽

Amplicon Sequencing ◽

R Package ◽

Bayesian Regression ◽

Thematic Structure ◽

Microbiome Data ◽

Functional Components ◽

Functional Content

ABSTRACTBackgroundAnalysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). But elucidating key associations is often difficult since microbiome data are compositional, high dimensional, and sparse. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to, for example, host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. In addition, analyses that utilize both taxonomic and predicted functional abundances typically independently characterize the taxonomic and functional profiles before linking them to sample information. This prevents investigators from identifying the specific functional components associate with which subsets of co-occurring taxa.ResultsWe provide an approach to explore co-occurring taxa using “topics” generated via a topic model and then link these topics to specific sample classes (e.g., diseased versus healthy). Rather than inferring predicted functional content independently from taxonomic abundances, we instead focus on inference of functional content within topics, which we parse by estimating pathway-topic interactions through a multilevel, fully Bayesian regression model. We apply our methods to two large publically available 16S amplicon sequencing datasets: an inflammatory bowel disease (IBD) dataset from Gevers et al. and data from the American Gut (AG) project. When applied to the Gevers et al. IBD study, we demonstrate that a topic highly associated with Crohn’s disease (CD) diagnosis is (1) dominated by a cluster of bacteria known to be linked with CD and (2) uniquely enriched for a subset of lipopolysaccharide (LPS) synthesis genes. In the AG data, our approach found that individuals with plant-based diets were enriched with Lachnospiraceae, Roseburia, Blautia, and Ruminococcaceae, as well as fluorobenzoate degradation pathways, whereas pathways involved in LPS biosynthesis were depleted.ConclusionsWe introduce an approach for uncovering latent thematic structure in the context of sample features for 16S rRNA surveys. Using our topic-model approach, investigators can (1) capture groups of co-occurring taxa termed topics, (2) uncover within-topic functional potential, and (3) identify gene sets that may guide future inquiry. These methods have been implemented in a freely available R package https://github.com/EESI/themetagenomics.

Download Full-text

metagenomeFeatures: An R package for working with 16S rRNA reference databases and marker-gene survey feature data

10.1101/339812 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nathan D. Olson ◽

Nidhi Shah ◽

Jayaram Kancherla ◽

Justin Wagner ◽

Joseph N. Paulson ◽

...

Keyword(s):

16S Rrna ◽

Marker Gene ◽

R Package ◽

Bioconductor Package ◽

Rrna Sequence ◽

16S Rrna Sequence ◽

Crucial Step ◽

Reference Databases ◽

Database Comparison ◽

Sequence Databases

AbstractWe developed the metagenomeFeatures R Bioconductor package along with annotation packages for the three primary 16S rRNA databases (Greengenes, RDP, and SILVA) to facilitate working with 16S rRNA sequence databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for working with marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different 16S rRNA databases facilitating database comparison and exploration. The mgFeatures represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R.Availabilityhttps://bioconductor.org/packages/release/bioc/html/[email protected]

Download Full-text

metagenomeFeatures: an R package for working with 16S rRNA reference databases and marker-gene survey feature data

Bioinformatics ◽

10.1093/bioinformatics/btz136 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3870-3872 ◽

Cited By ~ 1

Author(s):

Nathan D Olson ◽

Nidhi Shah ◽

Jayaram Kancherla ◽

Justin Wagner ◽

Joseph N Paulson ◽

...

Keyword(s):

16S Rrna ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Rrna Sequence ◽

16S Rrna Sequence ◽

Reference Databases ◽

Supplementary Material ◽

Database Comparison

Abstract Summary We developed the metagenomeFeatures R Bioconductor package along with annotation packages for three 16S rRNA databases (Greengenes, RDP and SILVA) to facilitate working with 16S rRNA databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different databases facilitating database comparison and exploration. The mgFeatures-class represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R. Availability and implementation https://bioconductor.org/packages/release/bioc/html/metagenomeFeatures.html. Supplementary information Supplementary material is available at Bioinformatics online.

Download Full-text

Metabolic pathways inferred from a bacterial marker gene illuminate ecological changes across South Pacific frontal boundaries

Nature Communications ◽

10.1038/s41467-021-22409-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Eric J. Raes ◽

Kristen Karsh ◽

Swan L. S. Sow ◽

Martin Ostrowski ◽

Mark V. Brown ◽

...

Keyword(s):

16S Rrna ◽

Metabolic Pathways ◽

Low Cost ◽

Marker Gene ◽

South Pacific ◽

Rrna Gene ◽

South Pacific Ocean ◽

Bacterial Marker ◽

Gene 16S Rrna ◽

Gene Data

AbstractGlobal oceanographic monitoring initiatives originally measured abiotic essential ocean variables but are currently incorporating biological and metagenomic sampling programs. There is, however, a large knowledge gap on how to infer bacterial functions, the information sought by biogeochemists, ecologists, and modelers, from the bacterial taxonomic information (produced by bacterial marker gene surveys). Here, we provide a correlative understanding of how a bacterial marker gene (16S rRNA) can be used to infer latitudinal trends for metabolic pathways in global monitoring campaigns. From a transect spanning 7000 km in the South Pacific Ocean we infer ten metabolic pathways from 16S rRNA gene sequences and 11 corresponding metagenome samples, which relate to metabolic processes of primary productivity, temperature-regulated thermodynamic effects, coping strategies for nutrient limitation, energy metabolism, and organic matter degradation. This study demonstrates that low-cost, high-throughput bacterial marker gene data, can be used to infer shifts in the metabolic strategies at the community scale.

Download Full-text

Specialized Pro-Resolving Mediator Lipidome and 16S rRNA Bacterial Microbiome Data Associated with Human Chronic Rhinosinusitis

Data in Brief ◽

10.1016/j.dib.2021.107023 ◽

2021 ◽

pp. 107023

Author(s):

Thad W. Vickery ◽

Michael Armstrong ◽

Jennifer M. Kofonow ◽

Charles E. Robertson ◽

Miranda E. Kroehl ◽

...

Keyword(s):

16S Rrna ◽

Chronic Rhinosinusitis ◽

Bacterial Microbiome ◽

Microbiome Data

Download Full-text

PPIT: an R package for inferring microbial taxonomy from nifH sequences

Bioinformatics ◽

10.1093/bioinformatics/btab100 ◽

2021 ◽

Author(s):

Bennett J Kapili ◽

Anne E Dekas

Keyword(s):

Gene Transfer ◽

Horizontal Gene Transfer ◽

Query Sequence ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Marker Genes ◽

Pairwise Identity ◽

Metabolic Marker ◽

Microbial Taxonomy

Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Exploring protocol bias in airway microbiome studies: one versus two PCR steps and 16S rRNA gene region V3 V4 versus V4

BMC Genomics ◽

10.1186/s12864-020-07252-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Christine Drengenes ◽

Tomas M. L. Eagan ◽

Ingvild Haaland ◽

Harald G. Wiker ◽

Rune Nielsen

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Bacterial Load ◽

Marker Gene ◽

Gene Region ◽

Lower Airway ◽

Rrna Gene ◽

Wide Range ◽

Airway Microbiome ◽

The Impact

Abstract Background Studies on the airway microbiome have been performed using a wide range of laboratory protocols for high-throughput sequencing of the bacterial 16S ribosomal RNA (16S rRNA) gene. We sought to determine the impact of number of polymerase chain reaction (PCR) steps (1- or 2- steps) and choice of target marker gene region (V3 V4 and V4) on the presentation of the upper and lower airway microbiome. Our analyses included lllumina MiSeq sequencing following three setups: Setup 1 (2-step PCR; V3 V4 region), Setup 2 (2-step PCR; V4 region), Setup 3 (1-step PCR; V4 region). Samples included oral wash, protected specimen brushes and protected bronchoalveolar lavage (healthy and obstructive lung disease), and negative controls. Results The number of sequences and amplicon sequence variants (ASV) decreased in order setup1 > setup2 > setup3. This trend appeared to be associated with an increased taxonomic resolution when sequencing the V3 V4 region (setup 1) and an increased number of small ASVs in setups 1 and 2. The latter was considered a result of contamination in the two-step PCR protocols as well as sequencing across multiple runs (setup 1). Although genera Streptococcus, Prevotella, Veillonella and Rothia dominated, differences in relative abundance were observed across all setups. Analyses of beta-diversity revealed that while oral wash samples (high biomass) clustered together regardless of number of PCR steps, samples from the lungs (low biomass) separated. The removal of contaminants identified using the Decontam package in R, did not resolve differences in results between sequencing setups. Conclusions Differences in number of PCR steps will have an impact of final bacterial community descriptions, and more so for samples of low bacterial load. Our findings could not be explained by differences in contamination levels alone, and more research is needed to understand how variations in PCR-setups and reagents may be contributing to the observed protocol bias.

Download Full-text

tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R

BMC Bioinformatics ◽

10.1186/s12859-021-03967-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Charlie M. Carpenter ◽

Daniel N. Frank ◽

Kayla Williamson ◽

Jaron Arbet ◽

Brandie D. Wagner ◽

...

Keyword(s):

Microbial Communities ◽

Open Source ◽

Data Structures ◽

Negative Binomial ◽

Rocky Mountain ◽

R Package ◽

Microbiome Analysis ◽

External Data ◽

Data Tables ◽

Microbiome Data

Abstract Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.

Download Full-text

Utilizing the Microbiota and Machine Learning Algorithms to Assess Risk of Salmonella Contamination in Poultry Rinsate

Journal of Food Protection ◽

10.4315/jfp-20-367 ◽

2021 ◽

Author(s):

Hannah Bolinger ◽

David Tran ◽

Kenneth Harary ◽

George C. Paoli ◽

Giselle Guron ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Diagnostic Tools ◽

Sequencing Data ◽

Testing Methods ◽

16S Sequencing ◽

Sequencing Technologies ◽

Microbiological Testing ◽

Microbiome Data ◽

Larger Sample

Traditional microbiological testing methods are slow, and many molecular-based techniques rely on culture-based enrichment to overcome low limits of detection. Recent advancements in sequencing technologies may make it possible to utilize machine learning (ML) to identify patterns in microbiome data to potentially predict the presence or absence of pathogens. In this study, 299 poultry rinsate samples from various points in the processing chain were analyzed to determine if microbiota could inform about a sample’s risk for containing Salmonella . Samples were culture confirmed as Salmonella -positive or -negative following modified USDA MLG protocols. The culture confirmation result was used as a reference to compare with 16S sequencing data. Pre-chill samples tested positive (71/82) at a higher frequency than post-chill samples (30/217) and contained greater microbial diversity. Due to their larger sample size, post-chill samples were analyzed more deeply. Analysis of variance (ANOVA) identified a significant effect of chilling on the number of genera (p<0.001), but analysis of similarities (ANOSIM) failed to provide evidence for microbial dissimilarity between pre- and post-chill samples (p=0.001, R=0.443). Various ML models were trained using post-chill samples to predict if a sample contained Salmonella based on the samples’ microbiota pre-enrichment. The optimal model was a Random Forest-based model with a performance as follows: accuracy (88%), sensitivity (85%), specificity (90%). While the algorithms described in this paper are prototypes, these risk-based algorithms demonstrate the potential and need for further studies to provide insight alongside diagnostic tests. Combining risk-based information with diagnostic tools can help poultry processors make informed decisions to help identify and prevent the spread of Salmonella . These data add to the growing body of literature exploring novel ways to utilize microbiome data for predictive food safety.

Download Full-text

Performance and Application of 16S rRNA Gene Cycle Sequencing for Routine Identification of Bacteria in the Clinical Microbiology Laboratory

Clinical Microbiology Reviews ◽

10.1128/cmr.00053-19 ◽

2020 ◽

Vol 33 (4) ◽

Cited By ~ 1

Author(s):

Deirdre L. Church ◽

Lorenzo Cerutti ◽

Antoine Gürtler ◽

Thomas Griener ◽

Adrian Zelazny ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Clinical Microbiology ◽

Clinical Microbiology Laboratory ◽

Rrna Gene ◽

Microbiology Laboratory ◽

Cycle Sequencing ◽

16S Sequencing ◽

Routine Identification ◽

Identification Of Bacteria

SUMMARY This review provides a state-of-the-art description of the performance of Sanger cycle sequencing of the 16S rRNA gene for routine identification of bacteria in the clinical microbiology laboratory. A detailed description of the technology and current methodology is outlined with a major focus on proper data analyses and interpretation of sequences. The remainder of the article is focused on a comprehensive evaluation of the application of this method for identification of bacterial pathogens based on analyses of 16S multialignment sequences. In particular, the existing limitations of similarity within 16S for genus- and species-level differentiation of clinically relevant pathogens and the lack of sequence data currently available in public databases is highlighted. A multiyear experience is described of a large regional clinical microbiology service with direct 16S broad-range PCR followed by cycle sequencing for direct detection of pathogens in appropriate clinical samples. The ability of proteomics (matrix-assisted desorption ionization-time of flight) versus 16S sequencing for bacterial identification and genotyping is compared. Finally, the potential for whole-genome analysis by next-generation sequencing (NGS) to replace 16S sequencing for routine diagnostic use is presented for several applications, including the barriers that must be overcome to fully implement newer genomic methods in clinical microbiology. A future challenge for large clinical, reference, and research laboratories, as well as for industry, will be the translation of vast amounts of accrued NGS microbial data into convenient algorithm testing schemes for various applications (i.e., microbial identification, genotyping, and metagenomics and microbiome analyses) so that clinically relevant information can be reported to physicians in a format that is understood and actionable. These challenges will not be faced by clinical microbiologists alone but by every scientist involved in a domain where natural diversity of genes and gene sequences plays a critical role in disease, health, pathogenicity, epidemiology, and other aspects of life-forms. Overcoming these challenges will require global multidisciplinary efforts across fields that do not normally interact with the clinical arena to make vast amounts of sequencing data clinically interpretable and actionable at the bedside.

Download Full-text