scholarly journals Detecting significant components of microbiomes by random forest with forward variable selection and phylogenetics

2020 ◽  
Author(s):  
Tung Dang ◽  
Hirohisa Kishino

AbstractA central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases.Supplementary informationHirohisa Kishino: [email protected] Dang: [email protected]

2021 ◽  
Author(s):  
Tung Dang ◽  
Hirohisa Kishino

Abstract Background: Random forest (RF) captures complex feature patterns that differentiate groups of samples and is rapidly being adopted in microbiome studies. However, a major challenge is the high dimensionality of microbiome datasets. They include thousands of species or molecular functions of particular biological interest. This high dimensionality significantly reduces the power of random forest approaches for identifying true differences. The widely used Boruta algorithm iteratively removes features that are proved by a statistical test to be less relevant than random probes. Result: We developed a massively parallel forward variable selection algorithm and coupled it with the RF classifier to maximize the predictive performance. The forward variable selection algorithm adds new variable to a set of selected variables as far as the prespecified criterion of predictive power is improved. At each step, the parameters of random forest are optimized. We demonstrated the performance of the proposed approach, which we named RF-FVS, by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The RF-FVS approach further screened the variables that the Boruta algorithm left and improved the accuracy of the random forest classifier from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. Conclusion: Valid variable selection is essential for the analysis of high-dimensional microbiota data. By adopting the Boruta algorithm for pre-screening of the variables, our proposed RF-FVS approach improves the accuracy of random forest significantly with minimum increase of computational burden. The procedure can be used to identify the functional profiles that differentiate samples between different conditions.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gongchao Jing ◽  
Yufeng Zhang ◽  
Wenzhi Cui ◽  
Lu Liu ◽  
Jian Xu ◽  
...  

Abstract Background Due to their much lower costs in experiment and computation than metagenomic whole-genome sequencing (WGS), 16S rRNA gene amplicons have been widely used for predicting the functional profiles of microbiome, via software tools such as PICRUSt 2. However, due to the potential PCR bias and gene profile variation among phylogenetically related genomes, functional profiles predicted from 16S amplicons may deviate from WGS-derived ones, resulting in misleading results. Results Here we present Meta-Apo, which greatly reduces or even eliminates such deviation, thus deduces much more consistent diversity patterns between the two approaches. Tests of Meta-Apo on > 5000 16S-rRNA amplicon human microbiome samples from 4 body sites showed the deviation between the two strategies is significantly reduced by using only 15 WGS-amplicon training sample pairs. Moreover, Meta-Apo enables cross-platform functional comparison between WGS and amplicon samples, thus greatly improve 16S-based microbiome diagnosis, e.g. accuracy of gingivitis diagnosis via 16S-derived functional profiles was elevated from 65 to 95% by WGS-based classification. Therefore, with the low cost of 16S-amplicon sequencing, Meta-Apo can produce a reliable, high-resolution view of microbiome function equivalent to that offered by shotgun WGS. Conclusions This suggests that large-scale, function-oriented microbiome sequencing projects can probably benefit from the lower cost of 16S-amplicon strategy, without sacrificing the precision in functional reconstruction that otherwise requires WGS. An optimized C++ implementation of Meta-Apo is available on GitHub (https://github.com/qibebt-bioinfo/meta-apo) under a GNU GPL license. It takes the functional profiles of a few paired WGS:16S-amplicon samples as training, and outputs the calibrated functional profiles for the much larger number of 16S-amplicon samples.


2013 ◽  
Vol 167 (4) ◽  
pp. 393-403 ◽  
Author(s):  
Jung Soh ◽  
Xiaoli Dong ◽  
Sean M. Caffrey ◽  
Gerrit Voordouw ◽  
Christoph W. Sensen

2018 ◽  
Vol 7 (14) ◽  
Author(s):  
Kyunghoi Kim

Deterioration of sediment quality has been found in the Nakdong River Estuary after large-scale reclamations. Here, I report microbial diversity in sediments of Nakdong River Estuary in the Republic of Korea based on 16S rRNA gene sequencing by next-generation sequencing (NGS) techniques.


Author(s):  
Yiqi Cao ◽  
Baiyu Zhang ◽  
Charles W. Greer ◽  
Kenneth Lee ◽  
Qinhong Cai ◽  
...  

The global increase in marine transportation of dilbit (diluted bitumen) can increase the risk of spills, and the application of chemical dispersants remains a common response practice in spill events. To reliably evaluate dispersant effects on dilbit biodegradation over time, we set large-scale (1500 mL) microcosms without nutrients addition using low dilbit concentration (30 ppm). Shotgun metagenomics and metatranscriptomics were deployed to investigate microbial community responses to naturally and chemically dispersed dilbit. We found that the large-scale microcosms could produce more reproducible community trajectories than small-scale (250 mL) ones based on the 16S rRNA gene amplicon sequencing. In the early-stage large-scale microcosms, multiple genera were involved into the biodegradation of dilbit, while dispersant addition enriched primarily Alteromonas and competed for the utilization of dilbit, causing depressed degradation of aromatics. The metatranscriptomic based Metagenome Assembled Genomes (MAG) further elucidated early-stage microbial antioxidation mechanism, which showed dispersant addition triggered the increased expression of the antioxidation process genes of Alteromonas species. Differently, in the late stage, the microbial communities showed high diversity and richness and similar compositions and metabolic functions regardless of dispersant addition, indicating the biotransformation of remaining compounds can occur within the post-oil communities. These findings can guide future microcosm studies and the application of chemical dispersants for responding to a marine dilbit spill. Importance In this study, we employed microcosms to study the effects of marine dilbit spill and dispersant application on microbial community dynamics over time. We evaluated the impacts of microcosm scale and found that increasing the scale is beneficial for reducing community stochasticity, especially in the late stage of biodegradation. We observed that dispersant application suppressed aromatics biodegradation in the early stage (6 days) whereas exerting insignificant effects in the late stage (50 days), from both substances removal and metagenomic/metatranscriptomic perspectives. We further found that Alteromonas species are vital for the early-stage chemically dispersed oil biodegradation, and clarified their degradation and antioxidation mechanisms. The findings would help to better understand microcosm studies and microbial roles for biodegrading dilbit and chemically dispersed dilbit, and suggest that dispersant evaluation in large-scale systems and even through field trails would be more realistic after marine oil spill response.


mSystems ◽  
2020 ◽  
Vol 5 (4) ◽  
Author(s):  
Ganesh Babu Malli Mohan ◽  
Ceth W. Parker ◽  
Camilla Urbaniak ◽  
Nitin K. Singh ◽  
Anthony Hood ◽  
...  

ABSTRACT Microbial contamination during long-term confinements of space exploration presents potential risks for both crew members and spacecraft life support systems. A novel swab kit was used to sample various surfaces from a submerged, closed, analog habitat to characterize the microbial populations. Samples were collected from various locations across the habitat which were constructed from various surface materials (linoleum, dry wall, particle board, glass, and metal), and microbial populations were examined by culture, quantitative PCR (qPCR), microbiome 16S rRNA gene sequencing, and shotgun metagenomics. Propidium monoazide (PMA)-treated samples identified the viable/intact microbial population of the habitat. The cultivable microbial population ranged from below the detection limit to 106 CFU/sample, and their identity was characterized using Sanger sequencing. Both 16S rRNA amplicon and shotgun sequencing were used to characterize the microbial dynamics, community profiles, and functional attributes (metabolism, virulence, and antimicrobial resistance). The 16S rRNA amplicon sequencing revealed abundance of viable (after PMA treatment) Actinobacteria (Brevibacterium, Nesternkonia, Mycobacterium, Pseudonocardia, and Corynebacterium), Firmicutes (Virgibacillus, Staphylococcus, and Oceanobacillus), and Proteobacteria (especially Acinetobacter) on linoleum, dry wall, and particle board (LDP) surfaces, while members of Firmicutes (Leuconostocaceae) and Proteobacteria (Enterobacteriaceae) were high on the glass/metal surfaces. Nonmetric multidimensional scaling determined from both 16S rRNA and metagenomic analyses revealed differential microbial species on LDP surfaces and glass/metal surfaces. The shotgun metagenomic sequencing of samples after PMA treatment showed bacterial predominance of viable Brevibacterium (53.6%), Brachybacterium (7.8%), Pseudonocardia (9.9%), Mycobacterium (3.7%), and Staphylococcus (2.1%), while fungal analyses revealed Aspergillus and Penicillium dominance. IMPORTANCE This study provides the first assessment of monitoring cultivable and viable microorganisms on surfaces within a submerged, closed, analog habitat. The results of the analyses presented herein suggest that the surface material plays a role in microbial community structure, as the microbial populations differed between LDP and metal/glass surfaces. The metal/glass surfaces had less-complex community, lower bioburden, and more closely resembled the controls. These results indicated that material choice is crucial when building closed habitats, even if they are simply analogs. Finally, while a few species were associated with previously cultivated isolates from the International Space Station and MIR spacecraft, the majority of the microbial ecology of the submerged analog habitat differs greatly from that of previously studied analog habitats.


2015 ◽  
Vol 5 (1) ◽  
Author(s):  
Kirsten A. Ziesemer ◽  
Allison E. Mann ◽  
Krithivasan Sankaranarayanan ◽  
Hannes Schroeder ◽  
Andrew T. Ozga ◽  
...  

Abstract To date, characterization of ancient oral (dental calculus) and gut (coprolite) microbiota has been primarily accomplished through a metataxonomic approach involving targeted amplification of one or more variable regions in the 16S rRNA gene. Specifically, the V3 region (E. coli 341–534) of this gene has been suggested as an excellent candidate for ancient DNA amplification and microbial community reconstruction. However, in practice this metataxonomic approach often produces highly skewed taxonomic frequency data. In this study, we use non-targeted (shotgun metagenomics) sequencing methods to better understand skewed microbial profiles observed in four ancient dental calculus specimens previously analyzed by amplicon sequencing. Through comparisons of microbial taxonomic counts from paired amplicon (V3 U341F/534R) and shotgun sequencing datasets, we demonstrate that extensive length polymorphisms in the V3 region are a consistent and major cause of differential amplification leading to taxonomic bias in ancient microbiome reconstructions based on amplicon sequencing. We conclude that systematic amplification bias confounds attempts to accurately reconstruct microbiome taxonomic profiles from 16S rRNA V3 amplicon data generated using universal primers. Because in silico analysis indicates that alternative 16S rRNA hypervariable regions will present similar challenges, we advocate for the use of a shotgun metagenomics approach in ancient microbiome reconstructions.


2019 ◽  
Author(s):  
Till Robin Lesker ◽  
Abilash Chakravarthy ◽  
Eric. J.C. Gálvez ◽  
Ilias Lagkouvardos ◽  
John F. Baines ◽  
...  

AbstractThe vast complexity of host-associated microbial ecosystems requires generation of host-specific gene catalogs to survey the functions and diversity of these communities. We generated a comprehensive resource, the integrated mouse gut metagenome catalog (iMGMC), comprising 4.6 million unique genes and 660 high-quality metagenome-assembled genomes (MAGs) linked to reconstructed full-length 16S rRNA gene sequences. iMGMC enables unprecedented coverage and taxonomic resolution, i.e. more than 89% of the identified taxa are not represented in any other databases. The tool (github.com/tillrobin/iMGMC) allowed characterizing the diversity and functions of prevalent and previously unknown microbial community members along the gastrointestinal tract. Moreover, we show that integration of MAGs and 16S rRNA gene data allows a more accurate prediction of functional profiles of communities than based on 16S rRNA amplicons alone. Integrated gene catalogs such as iMGMC are needed to enhance the resolution of numerous existing and future sequencing-based studies.


2021 ◽  
Author(s):  
Yingnan Gao ◽  
Martin Wu

Background: 16S rRNA gene has been widely used in microbial diversity studies to determine the community composition and structure. 16S rRNA gene copy number (16S GCN) varies among microbial species and this variation introduces biases to the relative cell abundance estimated using 16S rRNA read counts. To correct the biases, methods (e.g., PICRUST2) have been developed to predict 16S GCN. 16S GCN predictions come with inherent uncertainty, which is often ignored in the downstream analyses. However, a recent study suggests that the uncertainty can be so great that copy number correction is not justified in practice. Despite the significant implications in 16S rRNA based microbial diversity studies, the uncertainty associated with 16S GCN predictions has not been well characterized and its impact on microbial diversity studies needs to be investigated. Results: Here we develop RasperGade16S, a novel method and software to better model and capture the inherent uncertainty in 16S rRNA GCN prediction. RasperGade16S implements a maximum likelihood framework of pulsed evolution model and explicitly accounts for intraspecific GCN variation and heterogeneous GCN evolution rates among species. Using cross validation, we show that our method provides robust confidence estimates for the GCN predictions and outperforms PICRUST2 in both precision and recall. We have predicted GCN for 592605 OTUs in the SILVA database and tested 113842 bacterial communities that represent an exhaustive and diverse list of engineered and natural environments. We found that the prediction uncertainty is small enough for 99% of the communities that 16S GCN correction should improve their compositional and functional profiles estimated using 16S rRNA reads. On the other hand, we found that GCN variation has limited impacts on beta-diversity analyses such as PCoA, PERMANOVA and random forest test. Conclusion: We have developed a method to accurately account for uncertainty in 16S rRNA GCN predictions and the downstream analyses. For almost all 16S rRNA surveyed bacterial communities, correction of 16S GCN should improve the results when estimating their compositional and functional profiles. However, such correction is not necessary for beta-diversity analyses.


2021 ◽  
Vol 8 ◽  
Author(s):  
Yuanfeng Liu ◽  
Xiang Li ◽  
Yudie Yang ◽  
Ye Liu ◽  
Shijun Wang ◽  
...  

The gastrointestinal tract, the largest human microbial reservoir, is highly dynamic. The gut microbes play essential roles in causing colorectal diseases. In the present study, we explored potential keystone taxa during the development of colorectal diseases in central China. Fecal samples of some patients were collected and were allocated to the adenoma (Group A), colorectal cancer (Group C), and hemorrhoid (Group H) groups. The 16S rRNA amplicon and shallow metagenomic sequencing (SMS) strategies were used to recover the gut microbiota. Microbial diversities obtained from 16S rRNA amplicon and SMS data were similar. Group C had the highest diversity, although no significant difference in diversity was observed among the groups. The most dominant phyla in the gut microbiota of patients with colorectal diseases were Bacteroidetes, Firmicutes, and Proteobacteria, accounting for >95% of microbes in the samples. The most abundant genera in the samples were Bacteroides, Prevotella, and Escherichia/Shigella, and further species-level and network analyses identified certain potential keystone taxa in each group. Some of the dominant species, such as Prevotella copri, Bacteroides dorei, and Bacteroides vulgatus, could be responsible for causing colorectal diseases. The SMS data recovered diverse antibiotic resistance genes of tetracycline, macrolide, and beta-lactam, which could be a result of antibiotic overuse. This study explored the gut microbiota of patients with three different types of colorectal diseases, and the microbial diversity results obtained from 16S rRNA amplicon sequencing and SMS data were found to be similar. However, the findings of this study are based on a limited sample size, which warrants further large-scale studies. The recovery of gut microbiota profiles in patients with colorectal diseases could be beneficial for future diagnosis and treatment with modulation of the gut microbiota. Moreover, SMS data can provide accurate species- and gene-level information, and it is economical. It can therefore be widely applied in future clinical metagenomic studies.


Sign in / Sign up

Export Citation Format

Share Document