Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v1 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Computationally Efficient ◽

Highly Correlated ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.

Nomenclature Errors in Public 16S rRNA Gene Reference Databases

10.1101/441576 ◽

2018 ◽

Author(s):

Kyle Lesack ◽

Inanc Birol

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Standard Method ◽

Marker Gene ◽

Rrna Gene ◽

High Quality ◽

Quality Marker ◽

Reference Databases ◽

Reference Sequences ◽

The 16S Rrna Gene

AbstractBackgroundTargeted gene surveys of the 16S rRNA gene have become a standard method for profiling the membership and biodiversity of microbial communities. These studies rely upon specialized databases that provide reference sequences and their corresponding taxonomic classifications, but few independent evaluations of the nomenclature used in the taxonomic classifications have been performed.ResultsNomenclature data collected from the List of Prokaryotic names with Standing in Nomenclature, Prokaryotic Nomenclature Up-to-Date, and CyanoDB databases were used to validate the nomenclature contained in the taxonomic classifications in the Greengenes, RDP, and SILVA 16S rRNA gene reference databases. Between 82% and 97% of the genus annotations assigned to 16S rRNA gene reference sequences were deemed valid in the reference databases. Between 18% and 97% of the species annotations in Greengenes and SILVA were deemed valid. Misannotations included the use of metadata in place of taxonomic classifications, non-adherence to the binomial nomenclature, and sequences classified as eukaryote organelles or taxa.ConclusionsThe misannotations identified in public 16S rRNA gene databases call into question the reliability of research made using these resources. As targeted gene surveys depend on high quality marker gene databases, imed nomenclature accuracy will be necessary.

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v2 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Genome-based targeted sequencing as a reproducible microbial community profiling assay

10.1101/2020.08.07.241950 ◽

2020 ◽

Author(s):

Jacquelynn Benjamino ◽

Benjamin Leopold ◽

Daniel Phillips ◽

Mark D. Adams

Keyword(s):

16S Rrna ◽

Relative Abundance ◽

Marker Gene ◽

Cost Effective ◽

Reference Database ◽

New Approach ◽

Community Profiling ◽

Curtis Dissimilarity ◽

Stool Specimens ◽

Reference Genomes

AbstractCurrent sequencing-based methods for profiling microbial communities rely on marker gene (e.g. 16S rRNA) or metagenome shotgun sequencing (mWGS) analysis. We present a new approach based on highly multiplexed oligonucleotide probes designed from reference genomes in a pooled primer-extension reaction during library construction to derive relative abundance data. This approach, termed MA-GenTA: Microbial Abundances from Genome Tagged Analysis, enables quantitative, straightforward, cost-effective microbiome profiling that combines desirable features of both 16S rRNA and mWGS strategies. To test the utility of the MA-GenTA assay, probes were designed for 830 genome sequences representing bacteria present in mouse stool specimens. Comparison of the MA-GenTA data with mWGS data demonstrated excellent correlation down to 0.01% relative abundance and a similar number of organisms detected per sample. Despite the incompleteness of the reference database, NMDS clustering based on the Bray-Curtis dissimilarity metric of sample groups was consistent between MA-GenTA, mWGS and 16S rRNA datasets. MA-GenTA represents a potentially useful new method for microbiome community profiling based on reference genomes.

Metabolic pathways inferred from a bacterial marker gene illuminate ecological changes across South Pacific frontal boundaries

Nature Communications ◽

10.1038/s41467-021-22409-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Eric J. Raes ◽

Kristen Karsh ◽

Swan L. S. Sow ◽

Martin Ostrowski ◽

Mark V. Brown ◽

...

Keyword(s):

16S Rrna ◽

Metabolic Pathways ◽

Low Cost ◽

Marker Gene ◽

South Pacific ◽

Rrna Gene ◽

South Pacific Ocean ◽

Bacterial Marker ◽

Gene 16S Rrna ◽

Gene Data

AbstractGlobal oceanographic monitoring initiatives originally measured abiotic essential ocean variables but are currently incorporating biological and metagenomic sampling programs. There is, however, a large knowledge gap on how to infer bacterial functions, the information sought by biogeochemists, ecologists, and modelers, from the bacterial taxonomic information (produced by bacterial marker gene surveys). Here, we provide a correlative understanding of how a bacterial marker gene (16S rRNA) can be used to infer latitudinal trends for metabolic pathways in global monitoring campaigns. From a transect spanning 7000 km in the South Pacific Ocean we infer ten metabolic pathways from 16S rRNA gene sequences and 11 corresponding metagenome samples, which relate to metabolic processes of primary productivity, temperature-regulated thermodynamic effects, coping strategies for nutrient limitation, energy metabolism, and organic matter degradation. This study demonstrates that low-cost, high-throughput bacterial marker gene data, can be used to infer shifts in the metabolic strategies at the community scale.

Exploring protocol bias in airway microbiome studies: one versus two PCR steps and 16S rRNA gene region V3 V4 versus V4

BMC Genomics ◽

10.1186/s12864-020-07252-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Christine Drengenes ◽

Tomas M. L. Eagan ◽

Ingvild Haaland ◽

Harald G. Wiker ◽

Rune Nielsen

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Bacterial Load ◽

Marker Gene ◽

Gene Region ◽

Lower Airway ◽

Rrna Gene ◽

Wide Range ◽

Airway Microbiome ◽

The Impact

Abstract Background Studies on the airway microbiome have been performed using a wide range of laboratory protocols for high-throughput sequencing of the bacterial 16S ribosomal RNA (16S rRNA) gene. We sought to determine the impact of number of polymerase chain reaction (PCR) steps (1- or 2- steps) and choice of target marker gene region (V3 V4 and V4) on the presentation of the upper and lower airway microbiome. Our analyses included lllumina MiSeq sequencing following three setups: Setup 1 (2-step PCR; V3 V4 region), Setup 2 (2-step PCR; V4 region), Setup 3 (1-step PCR; V4 region). Samples included oral wash, protected specimen brushes and protected bronchoalveolar lavage (healthy and obstructive lung disease), and negative controls. Results The number of sequences and amplicon sequence variants (ASV) decreased in order setup1 > setup2 > setup3. This trend appeared to be associated with an increased taxonomic resolution when sequencing the V3 V4 region (setup 1) and an increased number of small ASVs in setups 1 and 2. The latter was considered a result of contamination in the two-step PCR protocols as well as sequencing across multiple runs (setup 1). Although genera Streptococcus, Prevotella, Veillonella and Rothia dominated, differences in relative abundance were observed across all setups. Analyses of beta-diversity revealed that while oral wash samples (high biomass) clustered together regardless of number of PCR steps, samples from the lungs (low biomass) separated. The removal of contaminants identified using the Decontam package in R, did not resolve differences in results between sequencing setups. Conclusions Differences in number of PCR steps will have an impact of final bacterial community descriptions, and more so for samples of low bacterial load. Our findings could not be explained by differences in contamination levels alone, and more research is needed to understand how variations in PCR-setups and reagents may be contributing to the observed protocol bias.

Study onYang-XuUsing Body Constitution Questionnaire and Blood Variables in Healthy Volunteers

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2016/9437382 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7 ◽

Cited By ~ 7

Author(s):

Hong-Jhang Chen ◽

Yii-Jeng Lin ◽

Pei-Chen Wu ◽

Wei-Hsiang Hsu ◽

Wan-Chung Hu ◽

...

Keyword(s):

Healthy Subjects ◽

Logistic Regression Model ◽

Cross Validation ◽

Blood Biomarkers ◽

Metabolic Characteristics ◽

Body Constitution ◽

Leave One Out ◽

The Relationship ◽

Fold Cross Validation ◽

Blood Variables

Traditional Chinese medicine (TCM) formulates treatment according to body constitution (BC) differentiation. Different constitutions have specific metabolic characteristics and different susceptibility to certain diseases. This study aimed to assess theYang-Xuconstitution using a body constitution questionnaire (BCQ) and clinical blood variables. A BCQ was employed to assess the clinical manifestation ofYang-Xu. The logistic regression model was conducted to explore the relationship between BC scores and biomarkers. Leave-one-out cross-validation (LOOCV) and K-fold cross-validation were performed to evaluate the accuracy of a predictive model in practice. Decision trees (DTs) were conducted to determine the possible relationships between blood biomarkers and BC scores. According to the BCQ analysis, 49% participants without any BC were classified as healthy subjects. Among them, 130 samples were selected for further analysis and divided into two groups. One group comprised healthy subjects without any BC (68%), while subjects of the other group, named as the sub-healthy group, had three BCs (32%). Six biomarkers, CRE, TSH, HB, MONO, RBC, and LH, were found to have the greatest impact on BCQ outcomes inYang-Xusubjects. This study indicated significant biochemical differences inYang-Xusubjects, which may provide a connection between blood variables and theYang-XuBC.

PSXII-22 Genomic prediction accuracy for feed efficiency related traits using different pseudo-phenotypes, prediction and validation methods in Nellore cattle

Journal of Animal Science ◽

10.1093/jas/skaa278.446 ◽

2020 ◽

Vol 98 (Supplement_4) ◽

pp. 245-246

Author(s):

Cláudio U Magnabosco ◽

Fernando Lopes ◽

Valentina Magnabosco ◽

Raysildo Lobo ◽

Leticia Pereira ◽

...

Keyword(s):

Body Weight ◽

Weight Gain ◽

Genomic Prediction ◽

Feed Efficiency ◽

Prediction Accuracy ◽

Body Weight Gain ◽

Prediction Methods ◽

Genomic Breeding ◽

Validation Population ◽

Nellore Cattle

Abstract The aim of the study was to evaluate prediction methods, validation approaches and pseudo-phenotypes for the prediction of the genomic breeding values of feed efficiency related traits in Nellore cattle. It used the phenotypic and genotypic information of 4,329 and 3,594 animals, respectively, which were tested for residual feed intake (RFI), dry matter intake (DMI), feed efficiency (FE), feed conversion ratio (FCR), residual body weight gain (RG), and residual intake and body weight gain (RIG). Six prediction methods were used: ssGBLUP, BayesA, BayesB, BayesCπ, BLASSO, and BayesR. Three validation approaches were used: 1) random: where the data was randomly divided into ten subsets and the validation was done in each subset at a time; 2) age: the division into the training (2010 to 2016) and validation population (2017) were based on the year of birth; 3) genetic breeding value (EBV) accuracy: the data was split in the training population being animals with accuracy above 0.45; and validation population those below 0.45. We checked the accuracy and bias of genomic value (GEBV). The results showed that the GEBV accuracy was the highest when the prediction is obtained with ssGBLUP (0.05 to 0.31) (Figure 1). The low heritability obtained, mainly for FE (0.07 ± 0.03) and FCR (0.09 ± 0.03), limited the GEBVs accuracy, which ranged from low to moderate. The regression coefficient estimates were close to 1, and similar between the prediction methods, validation approaches, and pseudo-phenotypes. The cross-validation presented the most accurate predictions ranging from 0.07 to 0.037. The prediction accuracy was higher for phenotype adjusted for fixed effects than for EBV and EBV deregressed (30.0 and 34.3%, respectively). Genomic prediction can provide a reliable estimate of genomic breeding values for RFI, DMI, RG and RGI, as to even say that those traits may have higher genetic gain than FE and FCR.

PSIX-3 Comparison of 16S rRNA gene profiles of rumen microbiome from Raramuri Criollo, European and Criollo x European lineages

Journal of Animal Science ◽

10.1093/jas/skab235.789 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 441-442

Author(s):

Adrian Maynez-Perez ◽

Francisco Jahuey-Martinez ◽

Jose A Martinez-Quintana ◽

Michael E Hume ◽

Robin C Anderson ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Beta Diversity ◽

Chihuahuan Desert ◽

Reference Database ◽

Rrna Gene ◽

Linear Discriminant ◽

Microbiome Composition ◽

Northern Mexico ◽

Rumen Microbiome

Abstract Raramuri Criollo cattle from the Chihuahuan desert in northern Mexico have been described as an ecological ecotype due to their enormous advantage in land grass utilization and their capacity to diversify their diet with cacti, forbs and woody plants. This diversification in diet utilization, could reflect upon their microbiome composition. The aim of this study was to characterize the rumen microbiome of Raramuri criollo cattle and to compare it to other lineages that graze in the same area. A total of 28 cows representing three linages [Criollo (n = 13), European (n = 9) and Criollo x European Crossbred (n = 6)] were grazed without supplementation for 45 days. DNA was extracted from ruminal samples and the V4 region of the 16S rRNA gene was sequenced on an Illumina platform. Data were analyzed with the QIIME2 software package and DADA2 plugin and the amplicon sequence variants were taxonomically classified with naïve Bayesian using the SILVA 16S rRNA gene reference database (version 132). Statistical analysis was performed by ANOVA and PERMANOVA for alpha and beta diversity indexes, respectively, and the non-strict version of linear discriminant analysis effect size (LEfSe) was used to determine significantly different taxa among lineages. Differences in beta diversity indexes (P < 0.05) were found in ruminal microbiome composition between Criollo and European groups, whereas the Crossbred showed intermediate values when compared to the pure breeds (Table 1). LEfSe analysis identified a total of 20 bacterial groups that explained differences between lineages, including one for Crossbreed, ten for European and nine for Criollo. These results show ruminal microbiome differences between Raramuri criollo cattle and the mainstream European breeds used in the northern Mexico Chihuahuan desert and reflect that those differences could be a consequence of dissimilar grazing behavior.