Explainable Boosting Machines for Slope Failure Spatial Predictive Modeling

Machine learning (ML) methods, such as artificial neural networks (ANN), k-nearest neighbors (kNN), random forests (RF), support vector machines (SVM), and boosted decision trees (DTs), may offer stronger predictive performance than more traditional, parametric methods, such as linear regression, multiple linear regression, and logistic regression (LR), for specific mapping and modeling tasks. However, this increased performance is often accompanied by increased model complexity and decreased interpretability, resulting in critiques of their “black box” nature, which highlights the need for algorithms that can offer both strong predictive performance and interpretability. This is especially true when the global model and predictions for specific data points need to be explainable in order for the model to be of use. Explainable boosting machines (EBM), an augmentation and refinement of generalize additive models (GAMs), has been proposed as an empirical modeling method that offers both interpretable results and strong predictive performance. The trained model can be graphically summarized as a set of functions relating each predictor variable to the dependent variable along with heat maps representing interactions between selected pairs of predictor variables. In this study, we assess EBMs for predicting the likelihood or probability of slope failure occurrence based on digital terrain characteristics in four separate Major Land Resource Areas (MLRAs) in the state of West Virginia, USA and compare the results to those obtained with LR, kNN, RF, and SVM. EBM provided predictive accuracies comparable to RF and SVM and better than LR and kNN. The generated functions and visualizations for each predictor variable and included interactions between pairs of predictor variables, estimation of variable importance based on average mean absolute scores, and provided scores for each predictor variable for new predictions add interpretability, but additional work is needed to quantify how these outputs may be impacted by variable correlation, inclusion of interaction terms, and large feature spaces. Further exploration of EBM is merited for geohazard mapping and modeling in particular and spatial predictive mapping and modeling in general, especially when the value or use of the resulting predictions would be greatly enhanced by improved interpretability globally and availability of prediction explanations at each cell or aggregating unit within the mapped or modeled extent.

Download Full-text

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽

10.1128/mbio.00434-20 ◽

2020 ◽

Vol 11 (3) ◽

Cited By ~ 9

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack T. Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Characteristic Curve ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

Microbial Biomarkers

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

A framework for effective application of machine learning to microbiome-based classification problems

10.1101/816090 ◽

2019 ◽

Cited By ~ 3

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

16S Rrna Sequence ◽

Microbial Biomarkers

AbstractMachine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.ImportanceDiagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely over-optimistic. Moreover, there is a trend towards using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step towards developing more reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C

Annals of Clinical Biochemistry International Journal of Laboratory Medicine ◽

10.1177/00045632211046805 ◽

2021 ◽

pp. 000456322110468

Author(s):

Anudeep P P ◽

Suchitra Kumari ◽

Aishvarya S Rajasimman ◽

Saurav Nayak ◽

Pooja Priyadarsini

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Forests ◽

Predictive Performance ◽

Support Vector ◽

Learning Models ◽

Complex Interactions ◽

Clinical Biochemistry Laboratory ◽

Study Laboratory ◽

Machine Learning Models

Background LDL-C is a strong risk factor for cardiovascular disorders. The formulas used to calculate LDL-C showed varying performance in different populations. Machine learning models can study complex interactions between the variables and can be used to predict outcomes more accurately. The current study evaluated the predictive performance of three machine learning models—random forests, XGBoost, and support vector Rregression (SVR) to predict LDL-C from total cholesterol, triglyceride, and HDL-C in comparison to linear regression model and some existing formulas for LDL-C calculation, in eastern Indian population. Methods The lipid profiles performed in the clinical biochemistry laboratory of AIIMS Bhubaneswar during 2019–2021, a total of 13,391 samples were included in the study. Laboratory results were collected from the laboratory database. 70% of data were classified as train set and used to develop the three machine learning models and linear regression formula. These models were tested in the rest 30% of the data (test set) for validation. Performance of models was evaluated in comparison to best six existing LDL-C calculating formulas. Results LDL-C predicted by XGBoost and random forests models showed a strong correlation with directly estimated LDL-C (r = 0.98). Two machine learning models performed superior to the six existing and commonly used LDL-C calculating formulas like Friedewald in the study population. When compared in different triglycerides strata also, these two models outperformed the other methods used. Conclusion Machine learning models like XGBoost and random forests can be used to predict LDL-C with more accuracy comparing to conventional linear regression LDL-C formulas.

Download Full-text

Homelessness and Unemployment During the Economic Recession: The Case of the City of Girona

European Scientific Journal ESJ ◽

10.19044/esj.2018.v14n13p59 ◽

2018 ◽

Vol 14 (13) ◽

pp. 59

Author(s):

Fran Calvo ◽

Xavier Carbonell ◽

Marc Badia

Keyword(s):

Linear Regression ◽

Predictor Variable ◽

Economic Recession ◽

Predictor Variables ◽

Structural Factors ◽

Simple Linear Regression ◽

Total Rate ◽

Correlation Tests ◽

The Impact ◽

The City

Although the research suggests that the main causes of homelessness are classified in individual and structural factors, there are few scientific articles which evaluate the impact of structural factors such as unemployment during periods of economic recession. The objective of this study is to compare the evolution of the total rate of homelessness with the total rate of unemployment in the city of Girona (Catalonia) during the economical recession (2006-2016) and to determine if unemployment is a predictive factor of homelessness. This is the first study with a Catalan sample comparing unemployment and homelessness. The design was longitudinal, retrospective and observational. The correlation tests between unemployment and homelessness indicated strong connections in the combination of the sample (r = .914, p <.001), men (r = .924, p <.001), and women (r = .716, p = 0.013). The results of the different models of simple linear regression used to determine the predictor variables of homelessness indicate that the rise of global unemployment is a predictor variable of the rise of global homelessness (ß = 2.17, p = .002) and male homelessness (ß = .82, p <.001). However, it does not predict specific female homelessness (ß = .88, p =.68).

Download Full-text

Prediction of Incident Cancers in the Lifelines Population-Based Cohort

Cancers ◽

10.3390/cancers13092133 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2133

Author(s):

Francisco O. Cortés-Ibañez ◽

Sunil Belur Nagaraj ◽

Ludo Cornelissen ◽

Gerjan J. Navis ◽

Bert van der Vegt ◽

...

Keyword(s):

Cancer Incidence ◽

Binary Classification ◽

Predictive Performance ◽

Population Based ◽

Support Vector ◽

Clinical Variables ◽

Incident Cancer ◽

History Of ◽

Diagnosis Of Cancer ◽

Auc Value

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.

Download Full-text

Comparing Charlson and Elixhauser comorbidity indices with different weightings to predict in-hospital mortality: an analysis of national inpatient data

BMC Health Services Research ◽

10.1186/s12913-020-05999-5 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Narayan Sharma ◽

René Schwendimann ◽

Olga Endrich ◽

Dietmar Ausserhofer ◽

Michael Simon

Keyword(s):

Hospital Mortality ◽

Patient Population ◽

Generalized Additive Models ◽

Routine Data ◽

Predictive Performance ◽

Population Based ◽

Mortality Prediction ◽

Additive Models ◽

General Hospitals ◽

Net Reclassification Improvement

Abstract Background Understanding how comorbidity measures contribute to patient mortality is essential both to describe patient health status and to adjust for risks and potential confounding. The Charlson and Elixhauser comorbidity indices are well-established for risk adjustment and mortality prediction. Still, a different set of comorbidity weights might improve the prediction of in-hospital mortality. The present study, therefore, aimed to derive a set of new Swiss Elixhauser comorbidity weightings, to validate and compare them against those of the Charlson and Elixhauser-based van Walraven weights in an adult in-patient population-based cohort of general hospitals. Methods Retrospective analysis was conducted with routine data of 102 Swiss general hospitals (2012–2017) for 6.09 million inpatient cases. To derive the Swiss weightings for the Elixhauser comorbidity index, we randomly halved the inpatient data and validated the results of part 1 alongside the established weighting systems in part 2, to predict in-hospital mortality. Charlson and van Walraven weights were applied to Charlson and Elixhauser comorbidity indices. Derivation and validation of weightings were conducted with generalized additive models adjusted for age, gender and hospital types. Results Overall, the Elixhauser indices, c-statistic with Swiss weights (0.867, 95% CI, 0.865–0.868) and van Walraven’s weights (0.863, 95% CI, 0.862–0.864) had substantial advantage over Charlson’s weights (0.850, 95% CI, 0.849–0.851) and in the derivation and validation groups. The net reclassification improvement of new Swiss weights improved the predictive performance by 1.6% on the Elixhauser-van Walraven and 4.9% on the Charlson weights. Conclusions All weightings confirmed previous results with the national dataset. The new Swiss weightings model improved slightly the prediction of in-hospital mortality in Swiss hospitals. The newly derive weights support patient population-based analysis of in-hospital mortality and seek country or specific cohort-based weightings.

Download Full-text

Non-Parametric Generalized Additive Models as a Tool for Evaluating Policy Interventions

Mathematics ◽

10.3390/math9040299 ◽

2021 ◽

Vol 9 (4) ◽

pp. 299

Author(s):

Jaime Pinilla ◽

Miguel Negrín

Keyword(s):

Linear Regression ◽

Regression Models ◽

Linear Trend ◽

Generalized Additive Models ◽

Additive Models ◽

Linear Regression Models ◽

Non Linear ◽

Nonlinear Trends ◽

The Impact ◽

Non Parametric

The interrupted time series analysis is a quasi-experimental design used to evaluate the effectiveness of an intervention. Segmented linear regression models have been the most used models to carry out this analysis. However, they assume a linear trend that may not be appropriate in many situations. In this paper, we show how generalized additive models (GAMs), a non-parametric regression-based method, can be useful to accommodate nonlinear trends. An analysis with simulated data is carried out to assess the performance of both models. Data were simulated from linear and non-linear (quadratic and cubic) functions. The results of this analysis show how GAMs improve on segmented linear regression models when the trend is non-linear, but they also show a good performance when the trend is linear. A real-life application where the impact of the 2012 Spanish cost-sharing reforms on pharmaceutical prescription is also analyzed. Seasonality and an indicator variable for the stockpiling effect are included as explanatory variables. The segmented linear regression model shows good fit of the data. However, the GAM concludes that the hypothesis of linear trend is rejected. The estimated level shift is similar for both models but the cumulative absolute effect on the number of prescriptions is lower in GAM.

Download Full-text

The Effect of Dual Language Activation on L2-Induced Changes in L1 Speech within a Code-Switched Paradigm

Languages ◽

10.3390/languages6030114 ◽

2021 ◽

Vol 6 (3) ◽

pp. 114

Author(s):

Ulrich Reubold ◽

Sanne Ditewig ◽

Robert Mayr ◽

Ineke Mennen

Keyword(s):

Dual Language ◽

Native Speakers ◽

Predictor Variable ◽

Predictor Variables ◽

Acoustic Measures ◽

Before And After ◽

Language Activation ◽

Dual Activation ◽

Induced Changes

The present study sought to examine the effect of dual language activation on L1 speech in late English–Austrian German sequential bilinguals, and to identify relevant predictor variables. To this end, we compared the English speech patterns of adult migrants to Austria in a code-switched and monolingual condition alongside those of monolingual native speakers in England in a monolingual condition. In the code-switched materials, German words containing target segments known to trigger cross-linguistic interaction in the two languages (i.e., [v–w], [ʃt(ʁ)-st(ɹ)] and [l-ɫ]) were inserted into an English frame; monolingual materials comprised English words with the same segments. To examine whether the position of the German item affects L1 speech, the segments occurred either before the switch (“He wants a Wienerschnitzel”) or after (“I like Würstel with mustard”). Critical acoustic measures of these segments revealed no differences between the groups in the monolingual condition, but significant L2-induced shifts in the bilinguals’ L1 speech production in the code-switched condition for some sounds. These were found to occur both before and after a code-switch, and exhibited a fair amount of individual variation. Only the amount of L2 use was found to be a significant predictor variable for shift size in code-switched compared with monolingual utterances, and only for [w]. These results have important implications for the role of dual activation in the speech of late sequential bilinguals.

Download Full-text

QUBO formulations for training machine learning models

Scientific Reports ◽

10.1038/s41598-021-89461-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Prasanna Date ◽

Davis Arthur ◽

Lauren Pusey-Nazzaro

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Large Scale ◽

Support Vector ◽

Quantum Computers ◽

Np Hard ◽

Learning Models ◽

Moore’S Law ◽

Moore's Law ◽

Machine Learning Models

AbstractTraining machine learning models on classical computers is usually a time and compute intensive process. With Moore’s law nearing its inevitable end and an ever-increasing demand for large-scale data analysis using machine learning, we must leverage non-conventional computing paradigms like quantum computing to train machine learning models efficiently. Adiabatic quantum computers can approximately solve NP-hard problems, such as the quadratic unconstrained binary optimization (QUBO), faster than classical computers. Since many machine learning problems are also NP-hard, we believe adiabatic quantum computers might be instrumental in training machine learning models efficiently in the post Moore’s law era. In order to solve problems on adiabatic quantum computers, they must be formulated as QUBO problems, which is very challenging. In this paper, we formulate the training problems of three machine learning models—linear regression, support vector machine (SVM) and balanced k-means clustering—as QUBO problems, making them conducive to be trained on adiabatic quantum computers. We also analyze the computational complexities of our formulations and compare them to corresponding state-of-the-art classical approaches. We show that the time and space complexities of our formulations are better (in case of SVM and balanced k-means clustering) or equivalent (in case of linear regression) to their classical counterparts.

Download Full-text

Improved Estimation of Winter Wheat Aboveground Biomass Using Multiscale Textures Extracted from UAV-Based Digital Images and Hyperspectral Feature Analysis

Remote Sensing ◽

10.3390/rs13040581 ◽

2021 ◽

Vol 13 (4) ◽

pp. 581 ◽

Cited By ~ 2

Author(s):

Yuanyuan Fu ◽

Guijun Yang ◽

Xiaoyu Song ◽

Zhenhong Li ◽

Xingang Xu ◽

...

Keyword(s):

Winter Wheat ◽

Least Squares ◽

Aboveground Biomass ◽

Regression Models ◽

Digital Images ◽

Predictive Performance ◽

Estimation Accuracy ◽

Support Vector ◽

Biomass Estimation ◽

High Definition

Rapid and accurate crop aboveground biomass estimation is beneficial for high-throughput phenotyping and site-specific field management. This study explored the utility of high-definition digital images acquired by a low-flying unmanned aerial vehicle (UAV) and ground-based hyperspectral data for improved estimates of winter wheat biomass. To extract fine textures for characterizing the variations in winter wheat canopy structure during growing seasons, we proposed a multiscale texture extraction method (Multiscale_Gabor_GLCM) that took advantages of multiscale Gabor transformation and gray-level co-occurrency matrix (GLCM) analysis. Narrowband normalized difference vegetation indices (NDVIs) involving all possible two-band combinations and continuum removal of red-edge spectra (SpeCR) were also extracted for biomass estimation. Subsequently, non-parametric linear (i.e., partial least squares regression, PLSR) and nonlinear regression (i.e., least squares support vector machine, LSSVM) analyses were conducted using the extracted spectral features, multiscale textural features and combinations thereof. The visualization technique of LSSVM was utilized to select the multiscale textures that contributed most to the biomass estimation for the first time. Compared with the best-performing NDVI (1193, 1222 nm), the SpeCR yielded higher coefficient of determination (R2), lower root mean square error (RMSE), and lower mean absolute error (MAE) for winter wheat biomass estimation and significantly alleviated the saturation problem after biomass exceeded 800 g/m2. The predictive performance of the PLSR and LSSVM regression models based on SpeCR decreased with increasing bandwidths, especially at bandwidths larger than 11 nm. Both the PLSR and LSSVM regression models based on the multiscale textures produced higher accuracies than those based on the single-scale GLCM-based textures. According to the evaluation of variable importance, the texture metrics “Mean” from different scales were determined as the most influential to winter wheat biomass. Using just 10 multiscale textures largely improved predictive performance over using all textures and achieved an accuracy comparable with using SpeCR. The LSSVM regression model based on the combination of the selected multiscale textures, and SpeCR with a bandwidth of 9 nm produced the highest estimation accuracy with R2val = 0.87, RMSEval = 119.76 g/m2, and MAEval = 91.61 g/m2. However, the combination did not significantly improve the estimation accuracy, compared to the use of SpeCR or multiscale textures only. The accuracy of the biomass predicted by the LSSVM regression models was higher than the results of the PLSR models, which demonstrated LSSVM was a potential candidate to characterize winter wheat biomass during multiple growth stages. The study suggests that multiscale textures derived from high-definition UAV-based digital images are competitive with hyperspectral features in predicting winter wheat biomass.

Download Full-text