scholarly journals A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽  
2020 ◽  
Vol 11 (3) ◽  
Author(s):  
Begüm D. Topçuoğlu ◽  
Nicholas A. Lesniak ◽  
Mack T. Ruffin ◽  
Jenna Wiens ◽  
Patrick D. Schloss

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

2019 ◽  
Author(s):  
Begüm D. Topçuoğlu ◽  
Nicholas A. Lesniak ◽  
Mack Ruffin ◽  
Jenna Wiens ◽  
Patrick D. Schloss

AbstractMachine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.ImportanceDiagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely over-optimistic. Moreover, there is a trend towards using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step towards developing more reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.


2022 ◽  
Vol 12 ◽  
Author(s):  
Shaowu Lin ◽  
Yafei Wu ◽  
Ya Fang

BackgroundDepression is highly prevalent and considered as the most common psychiatric disorder in home-based elderly, while study on forecasting depression risk in the elderly is still limited. In an endeavor to improve accuracy of depression forecasting, machine learning (ML) approaches have been recommended, in addition to the application of more traditional regression approaches.MethodsA prospective study was employed in home-based elderly Chinese, using baseline (2011) and follow-up (2013) data of the China Health and Retirement Longitudinal Study (CHARLS), a nationally representative cohort study. We compared four algorithms, including the regression-based models (logistic regression, lasso, ridge) and ML method (random forest). Model performance was assessed using repeated nested 10-fold cross-validation. As the main measure of predictive performance, we used the area under the receiver operating characteristic curve (AUC).ResultsThe mean AUCs of the four predictive models, logistic regression, lasso, ridge, and random forest, were 0.795, 0.794, 0.794, and 0.769, respectively. The main determinants were life satisfaction, self-reported memory, cognitive ability, ADL (activities of daily living) impairment, CESD-10 score. Life satisfaction increased the odds ratio of a future depression by 128.6% (logistic), 13.8% (lasso), and 13.2% (ridge), and cognitive ability was the most important predictor in random forest.ConclusionsThe three regression-based models and one ML algorithm performed equally well in differentiating between a future depression case and a non-depression case in home-based elderly. When choosing a model, different considerations, however, such as easy operating, might in some instances lead to one model being prioritized over another.


Author(s):  
Kazutaka Uchida ◽  
Junichi Kouno ◽  
Shinichi Yoshimura ◽  
Norito Kinjo ◽  
Fumihiro Sakakibara ◽  
...  

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.


2018 ◽  
Vol 26 (1) ◽  
pp. 141-155 ◽  
Author(s):  
Li Luo ◽  
Fengyi Zhang ◽  
Yao Yao ◽  
RenRong Gong ◽  
Martina Fu ◽  
...  

Surgery cancellations waste scarce operative resources and hinder patients’ access to operative services. In this study, the Wilcoxon and chi-square tests were used for predictor selection, and three machine learning models – random forest, support vector machine, and XGBoost – were used for the identification of surgeries with high risks of cancellation. The optimal performances of the identification models were as follows: sensitivity − 0.615; specificity − 0.957; positive predictive value − 0.454; negative predictive value − 0.904; accuracy − 0.647; and area under the receiver operating characteristic curve − 0.682. Of the three models, the random forest model achieved the best performance. Thus, the effective identification of surgeries with high risks of cancellation is feasible with stable performance. Models and sampling methods significantly affect the performance of identification. This study is a new application of machine learning for the identification of surgeries with high risks of cancellation and facilitation of surgery resource management.


2020 ◽  
Author(s):  
Jun Ke ◽  
Yiwei Chen ◽  
Xiaoping Wang ◽  
Zhiyong Wu ◽  
qiongyao Zhang ◽  
...  

Abstract BackgroundThe purpose of this study is to identify the risk factors of in-hospital mortality in patients with acute coronary syndrome (ACS) and to evaluate the performance of traditional regression and machine learning prediction models.MethodsThe data of ACS patients who entered the emergency department of Fujian Provincial Hospital from January 1, 2017 to March 31, 2020 for chest pain were retrospectively collected. The study used univariate and multivariate logistic regression analysis to identify risk factors for in-hospital mortality of ACS patients. The traditional regression and machine learning algorithms were used to develop predictive models, and the sensitivity, specificity, and receiver operating characteristic curve were used to evaluate the performance of each model.ResultsA total of 7810 ACS patients were included in the study, and the in-hospital mortality rate was 1.75%. Multivariate logistic regression analysis found that age and levels of D-dimer, cardiac troponin I, N-terminal pro-B-type natriuretic peptide (NT-proBNP), lactate dehydrogenase (LDH), high-density lipoprotein (HDL) cholesterol, and calcium channel blockers were independent predictors of in-hospital mortality. The study found that the area under the receiver operating characteristic curve of the models developed by logistic regression, gradient boosting decision tree (GBDT), random forest, and support vector machine (SVM) for predicting the risk of in-hospital mortality were 0.963, 0.960, 0.963, and 0.959, respectively. Feature importance evaluation found that NT-proBNP, LDH, and HDL cholesterol were top three variables that contribute the most to the prediction performance of the GBDT model and random forest model.ConclusionsThe predictive model developed using logistic regression, GBDT, random forest, and SVM algorithms can be used to predict the risk of in-hospital death of ACS patients. Based on our findings, we recommend that clinicians focus on monitoring the changes of NT-proBNP, LDH, and HDL cholesterol, as this may improve the clinical outcomes of ACS patients.


2021 ◽  
Author(s):  
Chen Bai ◽  
Yu-Peng Chen ◽  
Adam Wolach ◽  
Lisa Anthony ◽  
Mamoun Mardini

BACKGROUND Frequent spontaneous facial self-touches, predominantly during outbreaks, have the theoretical potential to be a mechanism of contracting and transmitting diseases. Despite the recent advent of vaccines, behavioral approaches remain an integral part of reducing the spread of COVID-19 and other respiratory illnesses. Real-time biofeedback of face touching can potentially mitigate the spread of respiratory diseases. The gap addressed in this study is the lack of an on-demand platform that utilizes motion data from smartwatches to accurately detect face touching. OBJECTIVE The aim of this study was to utilize the functionality and the spread of smartwatches to develop a smartwatch application to identifying motion signatures that are mapped accurately to face touching. METHODS Participants (n=10, 50% women, aged 20-83) performed 10 physical activities classified into: face touching (FT) and non-face touching (NFT) categories, in a standardized laboratory setting. We developed a smartwatch application on Samsung Galaxy Watch to collect raw accelerometer data from participants. Then, data features were extracted from consecutive non-overlapping windows varying from 2-16 seconds. We examined the performance of state-of-the-art machine learning methods on face touching movements recognition (FT vs NFT) and individual activity recognition (IAR): logistic regression, support vector machine, decision trees and random forest. RESULTS Machine learning models were accurate in recognizing face touching categories; logistic regression achieved the best performance across all metrics (Accuracy: 0.93 +/- 0.08, Recall: 0.89 +/- 0.16, Precision: 0.93 +/- 0.08, F1-score: 0.90 +/- 0.11, AUC: 0.95 +/- 0.07) at the window size of 5 seconds. IAR models resulted in lower performance; the random forest classifier achieved the best performance across all metrics (Accuracy: 0.70 +/- 0.14, Recall: 0.70 +/- 0.14, Precision: 0.70 +/- 0.16, F1-score: 0.67 +/- 0.15) at the window size of 9 seconds. CONCLUSIONS Wearable devices, powered with machine learning, are effective in detecting facial touches. This is highly significant during respiratory infection outbreaks, as it has a great potential to refrain people from touching their faces and potentially mitigate the possibility of transmitting COVID-19 and future respiratory diseases.


2021 ◽  
Vol 42 (Supplement_1) ◽  
Author(s):  
M J Espinosa Pascual ◽  
P Vaquero Martinez ◽  
V Vaquero Martinez ◽  
J Lopez Pais ◽  
B Izquierdo Coronel ◽  
...  

Abstract Introduction Out of all patients admitted with Myocardial Infarction, 10 to 15% have Myocardial Infarction with Non-Obstructive Coronaries Arteries (MINOCA). Classification algorithms based on deep learning substantially exceed traditional diagnostic algorithms. Therefore, numerous machine learning models have been proposed as useful tools for the detection of various pathologies, but to date no study has proposed a diagnostic algorithm for MINOCA. Purpose The aim of this study was to estimate the diagnostic accuracy of several automated learning algorithms (Support-Vector Machine [SVM], Random Forest [RF] and Logistic Regression [LR]) to discriminate between people suffering from MINOCA from those with Myocardial Infarction with Obstructive Coronary Artery Disease (MICAD) at the time of admission and before performing a coronary angiography, whether invasive or not. Methods A Diagnostic Test Evaluation study was carried out applying the proposed algorithms to a database constituted by 553 consecutive patients admitted to our Hospital with Myocardial Infarction. According to the definitions of 2016 ESC Position Paper on MINOCA, patients were classified into two groups: MICAD and MINOCA. Out of the total 553 patients, 214 were discarded due to the lack of complete data. The set of machine learning algorithms was trained on 244 patients (training sample: 75%) and tested on 80 patients (test sample: 25%). A total of 64 variables were available for each patient, including demographic, clinical and laboratorial features before the angiographic procedure. Finally, the diagnostic precision of each architecture was taken. Results The most accurate classification model was the Random Forest algorithm (Specificity [Sp] 0.88, Sensitivity [Se] 0.57, Negative Predictive Value [NPV] 0.93, Area Under the Curve [AUC] 0.85 [CI 0.83–0.88]) followed by the standard Logistic Regression (Sp 0.76, Se 0.57, NPV 0.92 AUC 0.74 and Support-Vector Machine (Sp 0.84, Se 0.38, NPV 0.90, AUC 0.78) (see graph). The variables that contributed the most in order to discriminate a MINOCA from a MICAD were the traditional cardiovascular risk factors, biomarkers of myocardial injury, hemoglobin and gender. Results were similar when the 19 patients with Takotsubo syndrome were excluded from the analysis. Conclusion A prediction system for diagnosing MINOCA before performing coronary angiographies was developed using machine learning algorithms. Results show higher accuracy of diagnosing MINOCA than conventional statistical methods. This study supports the potential of machine learning algorithms in clinical cardiology. However, further studies are required in order to validate our results. FUNDunding Acknowledgement Type of funding sources: None. ROC curves of different algorithms


Author(s):  
Elizabeth Ford ◽  
Philip Rooney ◽  
Seb Oliver ◽  
Richard Hoile ◽  
Peter Hurley ◽  
...  

Abstract Background Identifying dementia early in time, using real world data, is a public health challenge. As only two-thirds of people with dementia now ultimately receive a formal diagnosis in United Kingdom health systems and many receive it late in the disease process, there is ample room for improvement. The policy of the UK government and National Health Service (NHS) is to increase rates of timely dementia diagnosis. We used data from general practice (GP) patient records to create a machine-learning model to identify patients who have or who are developing dementia, but are currently undetected as having the condition by the GP. Methods We used electronic patient records from Clinical Practice Research Datalink (CPRD). Using a case-control design, we selected patients aged >65y with a diagnosis of dementia (cases) and matched them 1:1 by sex and age to patients with no evidence of dementia (controls). We developed a list of 70 clinical entities related to the onset of dementia and recorded in the 5 years before diagnosis. After creating binary features, we trialled machine learning classifiers to discriminate between cases and controls (logistic regression, naïve Bayes, support vector machines, random forest and neural networks). We examined the most important features contributing to discrimination. Results The final analysis included data on 93,120 patients, with a median age of 82.6 years; 64.8% were female. The naïve Bayes model performed least well. The logistic regression, support vector machine, neural network and random forest performed very similarly with an AUROC of 0.74. The top features retained in the logistic regression model were disorientation and wandering, behaviour change, schizophrenia, self-neglect, and difficulty managing. Conclusions Our model could aid GPs or health service planners with the early detection of dementia. Future work could improve the model by exploring the longitudinal nature of patient data and modelling decline in function over time.


Author(s):  
Marina Azer ◽  
◽  
Mohamed Taha ◽  
Hala H. Zayed ◽  
Mahmoud Gadallah

Social media presence is a crucial portion of our life. It is considered one of the most important sources of information than traditional sources. Twitter has become one of the prevalent social sites for exchanging viewpoints and feelings. This work proposes a supervised machine learning system for discovering false news. One of the credibility detection problems is finding new features that are most predictive to better performance classifiers. Both features depending on new content, and features based on the user are used. The features' importance is examined, and their impact on the performance. The reasons for choosing the final feature set using the k-best method are explained. Seven supervised machine learning classifiers are used. They are Naïve Bayes (NB), Support vector machine (SVM), Knearest neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Maximum entropy (ME), and conditional random forest (CRF). Training and testing models were conducted using the Pheme dataset. The feature's analysis is introduced and compared to the features depending on the content, as the decisive factors in determining the validity. Random forest shows the highest performance while using user-based features only and using a mixture of both types of features; features depending on content and the features based on the user, accuracy (82.2 %) in using user-based features only. We achieved the highest results by using both types of features, utilizing random forest classifier accuracy(83.4%). In contrast, logistic regression was the best as to using features that are based on contents. Performance is measured by different measurements accuracy, precision, recall, and F1_score. We compared our feature set with other studies' features and the impact of our new features. We found that our conclusions exhibit high enhancement concerning discovering and verifying the false news regarding the discovery and verification of false news, comparing it to the current results of how it is developed.


Author(s):  
Nelson Yego ◽  
Juma Kasozi ◽  
Joseph Nkrunziza

The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.


Sign in / Sign up

Export Citation Format

Share Document