scholarly journals Forward variable selection improves the power of random forest for high-dimensional microbiome data

Author(s):  
Tung Dang ◽  
Hirohisa Kishino

Abstract Background: Random forest (RF) captures complex feature patterns that differentiate groups of samples and is rapidly being adopted in microbiome studies. However, a major challenge is the high dimensionality of microbiome datasets. They include thousands of species or molecular functions of particular biological interest. This high dimensionality significantly reduces the power of random forest approaches for identifying true differences. The widely used Boruta algorithm iteratively removes features that are proved by a statistical test to be less relevant than random probes. Result: We developed a massively parallel forward variable selection algorithm and coupled it with the RF classifier to maximize the predictive performance. The forward variable selection algorithm adds new variable to a set of selected variables as far as the prespecified criterion of predictive power is improved. At each step, the parameters of random forest are optimized. We demonstrated the performance of the proposed approach, which we named RF-FVS, by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The RF-FVS approach further screened the variables that the Boruta algorithm left and improved the accuracy of the random forest classifier from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. Conclusion: Valid variable selection is essential for the analysis of high-dimensional microbiota data. By adopting the Boruta algorithm for pre-screening of the variables, our proposed RF-FVS approach improves the accuracy of random forest significantly with minimum increase of computational burden. The procedure can be used to identify the functional profiles that differentiate samples between different conditions.

2020 ◽  
Author(s):  
Tung Dang ◽  
Hirohisa Kishino

AbstractA central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases.Supplementary informationHirohisa Kishino: [email protected] Dang: [email protected]


2021 ◽  
pp. 1-15
Author(s):  
Zhaozhao Xu ◽  
Derong Shen ◽  
Yue Kou ◽  
Tiezheng Nie

Due to high-dimensional feature and strong correlation of features, the classification accuracy of medical data is not as good enough as expected. feature selection is a common algorithm to solve this problem, and selects effective features by reducing the dimensionality of high-dimensional data. However, traditional feature selection algorithms have the blindness of threshold setting and the search algorithms are liable to fall into a local optimal solution. Based on it, this paper proposes a hybrid feature selection algorithm combining ReliefF and Particle swarm optimization. The algorithm is mainly divided into three parts: Firstly, the ReliefF is used to calculate the feature weight, and the features are ranked by the weight. Then ranking feature is grouped according to the density equalization, where the density of features in each group is the same. Finally, the Particle Swarm Optimization algorithm is used to search the ranking feature groups, and the feature selection is performed according to a new fitness function. Experimental results show that the random forest has the highest classification accuracy on the features selected. More importantly, it has the least number of features. In addition, experimental results on 2 medical datasets show that the average accuracy of random forest reaches 90.20%, which proves that the hybrid algorithm has a certain application value.


Author(s):  
Yiqi Cao ◽  
Baiyu Zhang ◽  
Charles W. Greer ◽  
Kenneth Lee ◽  
Qinhong Cai ◽  
...  

The global increase in marine transportation of dilbit (diluted bitumen) can increase the risk of spills, and the application of chemical dispersants remains a common response practice in spill events. To reliably evaluate dispersant effects on dilbit biodegradation over time, we set large-scale (1500 mL) microcosms without nutrients addition using low dilbit concentration (30 ppm). Shotgun metagenomics and metatranscriptomics were deployed to investigate microbial community responses to naturally and chemically dispersed dilbit. We found that the large-scale microcosms could produce more reproducible community trajectories than small-scale (250 mL) ones based on the 16S rRNA gene amplicon sequencing. In the early-stage large-scale microcosms, multiple genera were involved into the biodegradation of dilbit, while dispersant addition enriched primarily Alteromonas and competed for the utilization of dilbit, causing depressed degradation of aromatics. The metatranscriptomic based Metagenome Assembled Genomes (MAG) further elucidated early-stage microbial antioxidation mechanism, which showed dispersant addition triggered the increased expression of the antioxidation process genes of Alteromonas species. Differently, in the late stage, the microbial communities showed high diversity and richness and similar compositions and metabolic functions regardless of dispersant addition, indicating the biotransformation of remaining compounds can occur within the post-oil communities. These findings can guide future microcosm studies and the application of chemical dispersants for responding to a marine dilbit spill. Importance In this study, we employed microcosms to study the effects of marine dilbit spill and dispersant application on microbial community dynamics over time. We evaluated the impacts of microcosm scale and found that increasing the scale is beneficial for reducing community stochasticity, especially in the late stage of biodegradation. We observed that dispersant application suppressed aromatics biodegradation in the early stage (6 days) whereas exerting insignificant effects in the late stage (50 days), from both substances removal and metagenomic/metatranscriptomic perspectives. We further found that Alteromonas species are vital for the early-stage chemically dispersed oil biodegradation, and clarified their degradation and antioxidation mechanisms. The findings would help to better understand microcosm studies and microbial roles for biodegrading dilbit and chemically dispersed dilbit, and suggest that dispersant evaluation in large-scale systems and even through field trails would be more realistic after marine oil spill response.


2019 ◽  
Author(s):  
Nooshin Shomal Zadeh ◽  
Sangdi Lin ◽  
George C Runger

Abstract Motivation Matched case–control analysis is widely used in biomedical studies to identify exposure variables associated with health conditions. The matching is used to improve the efficiency. Existing variable selection methods for matched case–control studies are challenged in high-dimensional settings where interactions among variables are also important. We describe a quite different method for high-dimensional matched case–control data, based on the potential outcome model, which is not only flexible regarding the number of matching and exposure variables but also able to detect interaction effects. Results We present Matched Forest (MF), an algorithm for variable selection in matched case–control data. The method preserves the case and control values in each instance but transforms the matched case–control data with added counterfactuals. A modified variable importance score from a supervised learner is used to detect important variables. The method is conceptually simple and can be applied with widely available software tools. Simulation studies show the effectiveness of MF in identifying important variables. MF is also applied to data from the biomedical domain and its performance is compared with alternative approaches. Availability and implementation R code for implementing MF is available at https://github.com/NooshinSh/Matched_Forest. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Tung Dang ◽  
Kie Kumaishi ◽  
Erika Usui ◽  
Shungo Kobori ◽  
Takumi Sato ◽  
...  

AbstractBackgroundThe rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of the clustering results. However, the huge dimensionality of microbial metagenomics data sets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the framework of the existing methods, computational burdens for identifying a small number of representative species from a huge number of observed species remain a challenge.ResultsWe proposed a novel framework to improve the performance of the widely used DMM approach by combining three ideas: (i) We extended the finite DMM model to the infinite case, via the consideration of Dirichlet process mixtures and estimate the number of clusters as a random variables. (ii) We proposed an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters. (iii) To address the computational burdens of the high-dimensional microbiome data, we proposed are a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation. With the proposed method named stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment and the human gut microbiome data from three published data sets of large-scale case-control studies.ConclusionsSVVS demonstrated a better performance and significantly faster computation than existing methods in all cases of testing data sets. In particular, SVVS is the only method that can analyze the massive high-dimensional microbial data with above 50,000 microbial species and 1,000 samples. Furthermore, it was suggested that microbial species selected as a core set played important roles in the recent microbiome studies.


2019 ◽  
Vol 30 (3) ◽  
pp. 697-719 ◽  
Author(s):  
Fan Wang ◽  
Sach Mukherjee ◽  
Sylvia Richardson ◽  
Steven M. Hill

AbstractPenalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a “no panacea” view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.


2018 ◽  
Author(s):  
Colleen Molloy Farrelly

Paper overviews variable selection problem in high dimensionality, particularly focused on genetic psychiatry and genetic epidemiology in general. Genetic and quantum evolutionary algorithms, tree-based classification/regression models, random forest, and other approaches are detailed. Paper concludes with a roadmap for new algorithm and two-stage selection methodology.


2020 ◽  
Author(s):  
Nicholas D. Youngblut ◽  
Jacobo de la Cuesta-Zuluaga ◽  
Ruth E. Ley

AbstractTree-based diversity measures incorporate phylogenetic or phenotypic relatedness into comparisons of microbial communities. This improves the identification of explanatory factors compared to tree-agnostic diversity measures. However, applying tree-based diversity measures to metagenome data is more challenging than for single-locus sequencing (e.g., 16S rRNA gene). The Genome Taxonomy Database (GTDB) provides a genome-based reference database that can be used for species-level metagenome profiling, and a multi-locus phylogeny of all genomes that can be employed for diversity calculations. Moreover, traits can be inferred from the genomic content of each representative, allowing for trait-based diversity measures. Still, it is unclear how metagenome-based assessments of microbiome diversity benefit from incorporating phylogeny or phenotype into measures of diversity. We assessed this by measuring phylogeny-based, trait-based, and tree-agnostic diversity measures from a large, global collection of human gut metagenomes composed of 33 studies and 3348 samples. We found phylogeny- and trait-based alpha diversity to better differentiate samples by westernization, age, and gender. PCoA ordinations of phylogeny- or trait-based weighted UniFrac explained more variance than tree-agnostic measures, which was largely a result of these measures emphasizing inter-phylum differences between Bacteroidaceae (Bacteroidota) and Enterobacteriaceae (Proteobacteria) versus just differences within Bacteroidaceae (Bacteroidota). The disease state of samples was better explained by tree-based weighted UniFrac, especially the presence of Shiga toxin-producing E. coli (STEC) and hypertension. Our findings show that metagenome diversity estimation benefits from incorporating a genome-derived phylogeny or traits.ImportanceEstimations of microbiome diversity are fundamental to understanding spatiotemporal changes of microbial communities and identifying which factors mediate such changes. Tree-based measures of diversity are widespread for amplicon-based microbiome studies due to their utility relative to tree-agnostic measures; however, tree-based measures are seldomly applied to shotgun metagenomics data. We evaluated the utility of phylogeny-, trait-, and tree-agnostic diversity measures on a large scale human gut metagenome dataset to help guide researchers with the complex task of evaluating microbiome diversity via metagenomics.


Sign in / Sign up

Export Citation Format

Share Document