Diversity Forests: Using Split Sampling to Enable Innovative Complex Split Procedures in Random Forests

AbstractThe diversity forest algorithm is an alternative candidate node split sampling scheme that makes innovative complex split procedures in random forests possible. While conventional univariable, binary splitting suffices for obtaining strong predictive performance, new complex split procedures can help tackling practically important issues. For example, interactions between features can be exploited effectively by bivariable splitting. With diversity forests, each split is selected from a candidate split set that is sampled in the following way: for $$l = 1, \dots , {nsplits}$$ l = 1 , ⋯ , nsplits : (1) sample one split problem; (2) sample a single or few splits from the split problem sampled in (1) and add this or these splits to the candidate split set. The split problems are specifically structured collections of splits that depend on the respective split procedure considered. This sampling scheme makes innovative complex split procedures computationally tangible while avoiding overfitting. Important general properties of the diversity forest algorithm are evaluated empirically using univariable, binary splitting. Based on 220 data sets with binary outcomes, diversity forests are compared with conventional random forests and random forests using extremely randomized trees. It is seen that the split sampling scheme of diversity forests does not impair the predictive performance of random forests and that the performance is quite robust with regard to the specified nsplits value. The recently developed interaction forests are the first diversity forest method that uses a complex split procedure. Interaction forests allow modeling and detecting interactions between features effectively. Further potential complex split procedures are discussed as an outlook.

Download Full-text

SUBiNN: a stacked uni- and bivariate kNN sparse ensemble

Advances in Data Analysis and Classification ◽

10.1007/s11634-021-00462-7 ◽

2021 ◽

Author(s):

Tiffany Elsten ◽

Mark de Rooij

Keyword(s):

Random Forests ◽

Nearest Neighbor ◽

Ensemble Methods ◽

Predictive Performance ◽

Ensemble Classifier ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Lasso Method ◽

Nearest Neighbor Classifiers

AbstractNearest Neighbor classification is an intuitive distance-based classification method. It has, however, two drawbacks: (1) it is sensitive to the number of features, and (2) it does not give information about the importance of single features or pairs of features. In stacking, a set of base-learners is combined in one overall ensemble classifier by means of a meta-learner. In this manuscript we combine univariate and bivariate nearest neighbor classifiers that are by itself easily interpretable. Furthermore, we combine these classifiers by a Lasso method that results in a sparse ensemble of nonlinear main and pairwise interaction effects. We christened the new method SUBiNN: Stacked Uni- and Bivariate Nearest Neighbors. SUBiNN overcomes the two drawbacks of simple nearest neighbor methods. In extensive simulations and using benchmark data sets, we evaluate the predictive performance of SUBiNN and compare it to other nearest neighbor ensemble methods as well as Random Forests and Support Vector Machines. Results indicate that SUBiNN often outperforms other nearest neighbor methods, that SUBiNN is well capable of identifying noise features, but that Random Forests is often, but not always, the best classifier.

Download Full-text

Development of Machine Learning Models to Predict Probabilities and Types of Stroke at Prehospital Stage: the Japan Urgent Stroke Triage Score Using Machine Learning (JUST-ML)

Translational Stroke Research ◽

10.1007/s12975-021-00937-x ◽

2021 ◽

Author(s):

Kazutaka Uchida ◽

Junichi Kouno ◽

Shinichi Yoshimura ◽

Norito Kinjo ◽

Fumihiro Sakakibara ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Prediction Models ◽

Characteristic Curve ◽

Predictive Performance ◽

Vessel Occlusion ◽

Predictive Values ◽

Training Cohort ◽

Sensitivity Specificity

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.

Download Full-text

A Methodological Framework to Discover Pharmacogenomic Interactions Based on Random Forests

Genes ◽

10.3390/genes12060933 ◽

2021 ◽

Vol 12 (6) ◽

pp. 933

Author(s):

Salvatore Fasola ◽

Giovanna Cilluffo ◽

Laura Montalbano ◽

Velia Malizia ◽

Giuliana Ferrante ◽

...

Keyword(s):

Random Forests ◽

Cancer Cell Line ◽

Predictive Performance ◽

Computational Time ◽

Gene Interactions ◽

Methodological Framework ◽

Concordance Correlation ◽

Driver Genes ◽

Tumor Tissues ◽

Gene Alterations

The identification of genomic alterations in tumor tissues, including somatic mutations, deletions, and gene amplifications, produces large amounts of data, which can be correlated with a diversity of therapeutic responses. We aimed to provide a methodological framework to discover pharmacogenomic interactions based on Random Forests. We matched two databases from the Cancer Cell Line Encyclopaedia (CCLE) project, and the Genomics of Drug Sensitivity in Cancer (GDSC) project. For a total of 648 shared cell lines, we considered 48,270 gene alterations from CCLE as input features and the area under the dose-response curve (AUC) for 265 drugs from GDSC as the outcomes. A three-step reduction to 501 alterations was performed, selecting known driver genes and excluding very frequent/infrequent alterations and redundant ones. For each model, we used the concordance correlation coefficient (CCC) for assessing the predictive performance, and permutation importance for assessing the contribution of each alteration. In a reasonable computational time (56 min), we identified 12 compounds whose response was at least fairly sensitive (CCC > 20) to the alteration profiles. Some diversities were found in the sets of influential alterations, providing clues to discover significant drug-gene interactions. The proposed methodological framework can be helpful for mining pharmacogenomic interactions.

Download Full-text

Machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District

10.31219/osf.io/csx6r ◽

2019 ◽

Author(s):

Zanya Reubenne D. Omadlao ◽

Nica Magdalena A. Tuguinay ◽

Ricarido Maglaqui Saturay

Keyword(s):

Machine Learning ◽

Daily Rainfall ◽

Predictive Performance ◽

Data Sets ◽

Prediction System ◽

True Positive ◽

Rainfall Thresholds ◽

Cumulative Rainfall ◽

Testing Data ◽

Positive Rate

A machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District is proposed to address the landslide risk due to the climate and topography of Benguet province. It is intended to improve the decision support system for road management with regards to landslides, as implemented by the Department of Public Works and Highways Benguet First District Engineering Office. Supervised classification was applied to daily rainfall and landslide data for the Benguet First Engineering District covering the years 2014 to 2018 using scikit-learn. Various forms of cumulative rainfall values were used to predict landslide occurrence for a given day. Following typical machine learning workflows, rainfall-landslide data set was divided into training and testing data sets. Machine learning algorithms such as K-Nearest Neighbors, Gaussian Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, Decision Tree, and AdaBoost were trained using the training data sets, and the trained models were used to make predictions based on the testing data sets. Predictive performance of the models vis-a-vis the testing data sets were compared using true positive rates, false positive rates, and the area under the Receiver Operating Characteristic Curve. Predictive performance of these models were then compared to 1-day cumulative rainfall thresholds commonly used for landslide predictions. Among the machine learning models evaluated, Gaussian Naïve Bayes has the best performance, with mean false positive rate, true positive rate and area under the curve scores of 7%, 76%, and 84% respectively. It also performs better than the 1-day cumulative rainfall thresholds. This research demonstrates the potential of machine learning for identifying temporal patterns in rainfall-induced landslides using minimal data input -- daily rainfall from a single synoptic station, and highway maintenance records. Such an approach may be tested and applied to similar problems in the field of disaster risk reduction and management.

Download Full-text

Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.6b00753 ◽

2017 ◽

Vol 57 (8) ◽

pp. 1773-1792 ◽

Cited By ~ 27

Author(s):

Richard L. Marchese Robinson ◽

Anna Palczewska ◽

Jan Palczewski ◽

Nathan Kidley

Keyword(s):

Random Forest ◽

Linear Models ◽

Predictive Performance ◽

Data Sets ◽

Benchmark Data

Download Full-text

Recommendations for Reporting Machine Learning Analyses in Clinical Research

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.120.006556 ◽

2020 ◽

Vol 13 (10) ◽

Cited By ~ 1

Author(s):

Laura M. Stevens ◽

Bobak J. Mortazavi ◽

Rahul C. Deo ◽

Lesley Curtis ◽

David P. Kao

Keyword(s):

Machine Learning ◽

Clinical Research ◽

Clinical Experience ◽

Clinical Data ◽

Critical Evaluation ◽

Predictive Performance ◽

Structured Reporting ◽

Data Sets ◽

Overwhelming Evidence ◽

Peer Reviewers

Use of machine learning (ML) in clinical research is growing steadily given the increasing availability of complex clinical data sets. ML presents important advantages in terms of predictive performance and identifying undiscovered subpopulations of patients with specific physiology and prognoses. Despite this popularity, many clinicians and researchers are not yet familiar with evaluating and interpreting ML analyses. Consequently, readers and peer-reviewers alike may either overestimate or underestimate the validity and credibility of an ML-based model. Conversely, ML experts without clinical experience may present details of the analysis that are too granular for a clinical readership to assess. Overwhelming evidence has shown poor reproducibility and reporting of ML models in clinical research suggesting the need for ML analyses to be presented in a clear, concise, and comprehensible manner to facilitate understanding and critical evaluation. We present a recommendation for transparent and structured reporting of ML analysis results specifically directed at clinical researchers. Furthermore, we provide a list of key reporting elements with examples that can be used as a template when preparing and submitting ML-based manuscripts for the same audience.

Download Full-text

A weighted random forests approach to improve predictive performance

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11196 ◽

2013 ◽

Vol 6 (6) ◽

pp. 496-505 ◽

Cited By ~ 36

Author(s):

Stacey J. Winham ◽

Robert R. Freimuth ◽

Joanna M. Biernacka

Keyword(s):

Random Forests ◽

Predictive Performance

Download Full-text

Dynamic interaction network inference from longitudinal microbiome data

10.1101/430462 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jose Lugo-Martinez ◽

Daniel Ruiz-Perez ◽

Giri Narasimhan ◽

Ziv Bar-Joseph

Keyword(s):

Network Inference ◽

Dynamic Models ◽

Dynamic Bayesian Network ◽

Interaction Network ◽

Predictive Performance ◽

Dynamic Bayesian Networks ◽

Data Sets ◽

Computational Pipeline ◽

Novel Interactions ◽

Microbiome Data

AbstractBackgroundSeveral studies have focused on the microbiota living in environmental niches including human body sites. In many of these studies researchers collect longitudinal data with the goal of understanding not just the composition of the microbiome but also the interactions between the different taxa. However, analysis of such data is challenging and very few methods have been developed to reconstruct dynamic models from time series microbiome data.ResultsHere we present a computational pipeline that enables the integration of data across individuals for the reconstruction of such models. Our pipeline starts by aligning the data collected for all individuals. The aligned profiles are then used to learn a dynamic Bayesian network which represents causal relationships between taxa and clinical variables. Testing our methods on three longitudinal microbiome data sets we show that our pipeline improve upon prior methods developed for this task. We also discuss the biological insights provided by the models which include several known and novel interactions.ConclusionsWe propose a computational pipeline for analyzing longitudinal microbiome data. Our results provide evidence that microbiome alignments coupled with dynamic Bayesian networks improve predictive performance over previous methods and enhance our ability to infer biological relationships within the microbiome and between taxa and clinical factors.

Download Full-text

Combining multiple data sources in species distribution models while accounting for spatial dependence and overfitting with combined penalised likelihood maximisation

10.1101/615583 ◽

2019 ◽

Author(s):

Ian W. Renner ◽

Julie Louvrier ◽

Olivier Gimenez

Keyword(s):

Species Distribution ◽

Spatial Dependence ◽

Process Model ◽

Predictive Performance ◽

Species Distribution Modelling ◽

Data Sets ◽

Multiple Data ◽

Log Likelihood ◽

Likelihood Approach ◽

Penalised Likelihood

SummaryThe increase in availability of species data sets means that approaches to species distribution modelling that incorporate multiple data sets are in greater demand. Recent methodological developments in this area have led to combined likelihood approaches, in which a log-likelihood comprised of the sum of the log-likelihood components of each data source is maximised. Often, these approaches make use of at least one presence-only data set and use the log-likelihood of an inhomogeneous Poisson point process model in the combined likelihood construction. While these advancements have been shown to improve predictive performance, they do not currently address challenges in presence-only modelling such as checking and correcting for violations of the independence assumption of a Poisson point process model or more general challenges in species distribution modelling such as overfitting.In this paper, we present an extension of the combined likelihood frame-work which accommodates alternative presence-only likelihoods in the presence of spatial dependence as well as lasso-type penalties to account for potential overfitting. We compare the proposed combined penalised likelihood approach to the standard combined likelihood approach via simulation and apply the method to modelling the distribution of the Eurasian lynx in the Jura Mountains in eastern France.The simulations show that the proposed combined penalised likelihood approach has better predictive performance than the standard approach when spatial dependence is present in the data. The lynx analysis shows that the predicted maps vary significantly between the model fitted with the proposed combined penalised approach accounting for spatial dependence and the model fitted with the standard combined likelihood.This work highlights the benefits of careful consideration of the presence-only components of the combined likelihood formulation, and allows greater flexibility and ability to accommodate real datasets.

Download Full-text

Comparative Analysis of Independent Ex Vivo functional Drug Screens Identifies Predictive Biomarkers of BCL-2 Inhibitor Response in AML

Blood ◽

10.1182/blood-2018-99-111916 ◽

2018 ◽

Vol 132 (Supplement 1) ◽

pp. 2763-2763 ◽

Cited By ~ 1

Author(s):

Brian S. White ◽

Suleiman A. Khan ◽

Muhammad Ammad-ud-din ◽

Swapnil Potdar ◽

Mike J Mason ◽

...

Keyword(s):

Gene Expression ◽

Board Of Directors ◽

Research Funding ◽

Drug Response ◽

Ex Vivo ◽

Predictive Performance ◽

Data Sets ◽

Advisory Committees ◽

Data Set ◽

Equity Ownership

Abstract Introduction: Therapeutic options for patients with AML were recently expanded with FDA approval of four drugs in 2017. As their efficacy is limited in some patient subpopulations and relapse ultimately ensues, there remains an urgent need for additional treatment options tailored to well-defined patient subpopulations to achieve durable responses. Two comprehensive profiling efforts were launched to address this need-the multi-center Beat AML initiative, led by the Oregon Health & Science University (OHSU) and the AML Individualized Systems Medicine program at the Institute for Molecular Medicine Finland (FIMM). Methods: We performed a comparative analysis of the two large-scale data sets in which patient samples were subjected to whole-exome sequencing, RNA-seq, and ex vivo functional drug sensitivity screens: OHSU (121 patients and 160 drugs) and FIMM (39 patients and 480 drugs). We predicted ex vivo drug response [quantified as area under the dose-response curve (AUC)] using gene expression signatures selected with standard regression and a novel Bayesian model designed to analyze multiple data sets simultaneously. We restricted analysis to the 95 drugs in common between the two data sets. Results: The ex vivo responses (AUCs) of most drugs were positively correlated (OHSU: median Pearson correlation r across all pairwise drug comparisons=0.27; FIMM: median r=0.33). Consistently, a samples's ex vivo response to an individual drug was often correlated with the patient's Average ex vivo Drug Sensitivity (ADS), i.e., the average response across the 95 drugs (OHSU: median r across 95 drugs=0.41; FIMM: median r=0.58). Patients with a complete response to standard induction therapy had a higher ADS than those that were refractory (p=0.01). Further, patients whose ADS was in the top quartile had improved overall survival relative to those having an ADS in the bottom quartile (p<0.05). Standard regression models (LASSO and Ridge) trained on ADS and gene expression in the OHSU data set had improved ex vivo response prediction performance as assessed in the independent FIMM validation data set relative to those trained on gene expression alone (LASSO: p=2.9x10-4; Ridge: p=4.4x10-3). Overall, ex vivo drug response was relatively well predicted (LASSO: mean r across 95 drugs=0.62; Ridge: mean r=0.62). The BCL-2 inhibitor venetoclax was the only drug whose response was negatively correlated with ADS in both data sets. We hypothesized that, whereas the predictive performance of many other drugs was likely dependent on ADS, the predictive performance of venetoclax (LASSO: r=0.53, p=0.01; Ridge: r=0.63, p=1.3x10-3) reflected specific gene expression biomarkers. To identify biomarkers associated with venetoclax sensitivity, we developed an integrative Bayesian machine learning method that jointly modeled both data sets, revealing several candidate biomarkers positively (BCL2 and FLT3) or negatively (CD14, MAFB, and LRP1) correlated with venetoclax response. We assessed these biomarkers in an independent data set that profiled ex vivo response to the BCL-2/BCL-XL inhibitor navitoclax in 29 AML patients (Lee et al.). All five biomarkers were validated in the Lee data set (Fig 1). Conclusions: The two independent ex vivo functional screens were highly concordant, demonstrating the reproducibility of the assays and the opportunity for their use in the clinic. Joint analysis of the two data sets robustly identified biomarkers of drug response for BCL-2 inhibitors. Two of these biomarkers, BCL2 and the previously-reported CD14, serve as positive controls credentialing our approach. CD14, MAFB, and LRP1 are involved in monocyte differentiation. The inverse correlation of their expression with venetoclax and navitoclax response is consistent with prior reports showing that monocytic cells are resistant to BCL-2 inhibition (Kuusanmäki et al.). These biomarker panels may enable better selection of patient populations likely to respond to BCL-2 inhibition than would any one biomarker in isolation. References: Kuusanmäki et al. (2017) Single-Cell Drug Profiling Reveals Maturation Stage-Dependent Drug Responses in AML, Blood 130:3821 Lee et al. (2018) A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia, Nat Commun 9:42 Disclosures Druker: Cepheid: Consultancy, Membership on an entity's Board of Directors or advisory committees; ALLCRON: Consultancy, Membership on an entity's Board of Directors or advisory committees; Fred Hutchinson Cancer Research Center: Research Funding; Celgene: Consultancy; Vivid Biosciences: Membership on an entity's Board of Directors or advisory committees; Aileron Therapeutics: Consultancy; Third Coast Therapeutics: Membership on an entity's Board of Directors or advisory committees; Oregon Health & Science University: Patents & Royalties; Patient True Talk: Consultancy; Millipore: Patents & Royalties; Monojul: Consultancy; Gilead Sciences: Consultancy, Membership on an entity's Board of Directors or advisory committees; Amgen: Membership on an entity's Board of Directors or advisory committees; Leukemia & Lymphoma Society: Membership on an entity's Board of Directors or advisory committees, Research Funding; GRAIL: Consultancy, Membership on an entity's Board of Directors or advisory committees; Beta Cat: Membership on an entity's Board of Directors or advisory committees; MolecularMD: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; Henry Stewart Talks: Patents & Royalties; Bristol-Meyers Squibb: Research Funding; Blueprint Medicines: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; Aptose Therapeutics: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; McGraw Hill: Patents & Royalties; ARIAD: Research Funding; Novartis Pharmaceuticals: Research Funding. Heckman:Orion Pharma: Research Funding; Novartis: Research Funding; Celgene: Research Funding. Porkka:Novartis: Honoraria, Research Funding; Celgene: Honoraria, Research Funding. Tyner:AstraZeneca: Research Funding; Incyte: Research Funding; Janssen: Research Funding; Leap Oncology: Equity Ownership; Seattle Genetics: Research Funding; Syros: Research Funding; Takeda: Research Funding; Gilead: Research Funding; Genentech: Research Funding; Aptose: Research Funding; Agios: Research Funding. Aittokallio:Novartis: Research Funding. Wennerberg:Novartis: Research Funding.

Download Full-text