A framework for effective application of machine learning to microbiome-based classification problems

AbstractMachine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.ImportanceDiagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely over-optimistic. Moreover, there is a trend towards using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step towards developing more reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems

mBio ◽

10.1128/mbio.00434-20 ◽

2020 ◽

Vol 11 (3) ◽

Cited By ~ 9

Author(s):

Begüm D. Topçuoğlu ◽

Nicholas A. Lesniak ◽

Mack T. Ruffin ◽

Jenna Wiens ◽

Patrick D. Schloss

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Sequence Data ◽

Characteristic Curve ◽

Predictive Performance ◽

Model Complexity ◽

Support Vector ◽

Classification Problems ◽

Microbial Biomarkers

ABSTRACT Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability. IMPORTANCE Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.

Download Full-text

Detecting Face Touching Using Smartwatches to Mitigate the Spread of COVID-19: Pilot Study (Preprint)

10.2196/preprints.28799 ◽

2021 ◽

Author(s):

Chen Bai ◽

Yu-Peng Chen ◽

Adam Wolach ◽

Lisa Anthony ◽

Mamoun Mardini

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Respiratory Diseases ◽

Window Size ◽

Support Vector ◽

Accelerometer Data ◽

Respiratory Illnesses ◽

Motion Data ◽

Machine Learning Methods

BACKGROUND Frequent spontaneous facial self-touches, predominantly during outbreaks, have the theoretical potential to be a mechanism of contracting and transmitting diseases. Despite the recent advent of vaccines, behavioral approaches remain an integral part of reducing the spread of COVID-19 and other respiratory illnesses. Real-time biofeedback of face touching can potentially mitigate the spread of respiratory diseases. The gap addressed in this study is the lack of an on-demand platform that utilizes motion data from smartwatches to accurately detect face touching. OBJECTIVE The aim of this study was to utilize the functionality and the spread of smartwatches to develop a smartwatch application to identifying motion signatures that are mapped accurately to face touching. METHODS Participants (n=10, 50% women, aged 20-83) performed 10 physical activities classified into: face touching (FT) and non-face touching (NFT) categories, in a standardized laboratory setting. We developed a smartwatch application on Samsung Galaxy Watch to collect raw accelerometer data from participants. Then, data features were extracted from consecutive non-overlapping windows varying from 2-16 seconds. We examined the performance of state-of-the-art machine learning methods on face touching movements recognition (FT vs NFT) and individual activity recognition (IAR): logistic regression, support vector machine, decision trees and random forest. RESULTS Machine learning models were accurate in recognizing face touching categories; logistic regression achieved the best performance across all metrics (Accuracy: 0.93 +/- 0.08, Recall: 0.89 +/- 0.16, Precision: 0.93 +/- 0.08, F1-score: 0.90 +/- 0.11, AUC: 0.95 +/- 0.07) at the window size of 5 seconds. IAR models resulted in lower performance; the random forest classifier achieved the best performance across all metrics (Accuracy: 0.70 +/- 0.14, Recall: 0.70 +/- 0.14, Precision: 0.70 +/- 0.16, F1-score: 0.67 +/- 0.15) at the window size of 9 seconds. CONCLUSIONS Wearable devices, powered with machine learning, are effective in detecting facial touches. This is highly significant during respiratory infection outbreaks, as it has a great potential to refrain people from touching their faces and potentially mitigate the possibility of transmitting COVID-19 and future respiratory diseases.

Download Full-text

Machine learning in the diagnosis of Myocardial Infarction with Non-Obstructive Coronary Arteries

European Heart Journal ◽

10.1093/eurheartj/ehab724.3067 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

M J Espinosa Pascual ◽

P Vaquero Martinez ◽

V Vaquero Martinez ◽

J Lopez Pais ◽

B Izquierdo Coronel ◽

...

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Obstructive Coronary Artery Disease ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Classification Model ◽

Support Vector

Abstract Introduction Out of all patients admitted with Myocardial Infarction, 10 to 15% have Myocardial Infarction with Non-Obstructive Coronaries Arteries (MINOCA). Classification algorithms based on deep learning substantially exceed traditional diagnostic algorithms. Therefore, numerous machine learning models have been proposed as useful tools for the detection of various pathologies, but to date no study has proposed a diagnostic algorithm for MINOCA. Purpose The aim of this study was to estimate the diagnostic accuracy of several automated learning algorithms (Support-Vector Machine [SVM], Random Forest [RF] and Logistic Regression [LR]) to discriminate between people suffering from MINOCA from those with Myocardial Infarction with Obstructive Coronary Artery Disease (MICAD) at the time of admission and before performing a coronary angiography, whether invasive or not. Methods A Diagnostic Test Evaluation study was carried out applying the proposed algorithms to a database constituted by 553 consecutive patients admitted to our Hospital with Myocardial Infarction. According to the definitions of 2016 ESC Position Paper on MINOCA, patients were classified into two groups: MICAD and MINOCA. Out of the total 553 patients, 214 were discarded due to the lack of complete data. The set of machine learning algorithms was trained on 244 patients (training sample: 75%) and tested on 80 patients (test sample: 25%). A total of 64 variables were available for each patient, including demographic, clinical and laboratorial features before the angiographic procedure. Finally, the diagnostic precision of each architecture was taken. Results The most accurate classification model was the Random Forest algorithm (Specificity [Sp] 0.88, Sensitivity [Se] 0.57, Negative Predictive Value [NPV] 0.93, Area Under the Curve [AUC] 0.85 [CI 0.83–0.88]) followed by the standard Logistic Regression (Sp 0.76, Se 0.57, NPV 0.92 AUC 0.74 and Support-Vector Machine (Sp 0.84, Se 0.38, NPV 0.90, AUC 0.78) (see graph). The variables that contributed the most in order to discriminate a MINOCA from a MICAD were the traditional cardiovascular risk factors, biomarkers of myocardial injury, hemoglobin and gender. Results were similar when the 19 patients with Takotsubo syndrome were excluded from the analysis. Conclusion A prediction system for diagnosing MINOCA before performing coronary angiographies was developed using machine learning algorithms. Results show higher accuracy of diagnosing MINOCA than conventional statistical methods. This study supports the potential of machine learning algorithms in clinical cardiology. However, further studies are required in order to validate our results. FUNDunding Acknowledgement Type of funding sources: None. ROC curves of different algorithms

Download Full-text

Identifying undetected dementia in UK primary care patients: a retrospective case-control study comparing machine-learning and standard epidemiological approaches

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0991-9 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 6

Author(s):

Elizabeth Ford ◽

Philip Rooney ◽

Seb Oliver ◽

Richard Hoile ◽

Peter Hurley ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Health Service ◽

Naive Bayes ◽

Case Control ◽

Naïve Bayes ◽

Support Vector ◽

Clinical Practice Research Datalink ◽

Patient Records

Abstract Background Identifying dementia early in time, using real world data, is a public health challenge. As only two-thirds of people with dementia now ultimately receive a formal diagnosis in United Kingdom health systems and many receive it late in the disease process, there is ample room for improvement. The policy of the UK government and National Health Service (NHS) is to increase rates of timely dementia diagnosis. We used data from general practice (GP) patient records to create a machine-learning model to identify patients who have or who are developing dementia, but are currently undetected as having the condition by the GP. Methods We used electronic patient records from Clinical Practice Research Datalink (CPRD). Using a case-control design, we selected patients aged >65y with a diagnosis of dementia (cases) and matched them 1:1 by sex and age to patients with no evidence of dementia (controls). We developed a list of 70 clinical entities related to the onset of dementia and recorded in the 5 years before diagnosis. After creating binary features, we trialled machine learning classifiers to discriminate between cases and controls (logistic regression, naïve Bayes, support vector machines, random forest and neural networks). We examined the most important features contributing to discrimination. Results The final analysis included data on 93,120 patients, with a median age of 82.6 years; 64.8% were female. The naïve Bayes model performed least well. The logistic regression, support vector machine, neural network and random forest performed very similarly with an AUROC of 0.74. The top features retained in the logistic regression model were disorientation and wandering, behaviour change, schizophrenia, self-neglect, and difficulty managing. Conclusions Our model could aid GPs or health service planners with the early detection of dementia. Future work could improve the model by exploring the longitudinal nature of patient data and modelling decline in function over time.

Download Full-text

Credibility Detection on Twitter News Using Machine Learning Approach

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2021.03.01 ◽

2021 ◽

Vol 13 (3) ◽

pp. 1-10

Author(s):

Marina Azer ◽

◽

Mohamed Taha ◽

Hala H. Zayed ◽

Mahmoud Gadallah

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Learning System ◽

Supervised Machine Learning ◽

Support Vector ◽

Sources Of Information ◽

Supervised Machine Learning Classifiers ◽

False News ◽

The Impact

Social media presence is a crucial portion of our life. It is considered one of the most important sources of information than traditional sources. Twitter has become one of the prevalent social sites for exchanging viewpoints and feelings. This work proposes a supervised machine learning system for discovering false news. One of the credibility detection problems is finding new features that are most predictive to better performance classifiers. Both features depending on new content, and features based on the user are used. The features' importance is examined, and their impact on the performance. The reasons for choosing the final feature set using the k-best method are explained. Seven supervised machine learning classifiers are used. They are Naïve Bayes (NB), Support vector machine (SVM), Knearest neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Maximum entropy (ME), and conditional random forest (CRF). Training and testing models were conducted using the Pheme dataset. The feature's analysis is introduced and compared to the features depending on the content, as the decisive factors in determining the validity. Random forest shows the highest performance while using user-based features only and using a mixture of both types of features; features depending on content and the features based on the user, accuracy (82.2 %) in using user-based features only. We achieved the highest results by using both types of features, utilizing random forest classifier accuracy(83.4%). In contrast, logistic regression was the best as to using features that are based on contents. Performance is measured by different measurements accuracy, precision, recall, and F1_score. We compared our feature set with other studies' features and the impact of our new features. We found that our conclusions exhibit high enhancement concerning discovering and verifying the false news regarding the discovery and verification of false news, comparing it to the current results of how it is developed.

Download Full-text

Comparison of Regression and Machine Learning Methods in Depression Forecasting Among Home-Based Elderly Chinese: A Community Based Study

Frontiers in Psychiatry ◽

10.3389/fpsyt.2021.764806 ◽

2022 ◽

Vol 12 ◽

Author(s):

Shaowu Lin ◽

Yafei Wu ◽

Ya Fang

Keyword(s):

Machine Learning ◽

Life Satisfaction ◽

Logistic Regression ◽

Random Forest ◽

Cognitive Ability ◽

Characteristic Curve ◽

Model Performance ◽

Predictive Performance ◽

Home Based ◽

Elderly Chinese

BackgroundDepression is highly prevalent and considered as the most common psychiatric disorder in home-based elderly, while study on forecasting depression risk in the elderly is still limited. In an endeavor to improve accuracy of depression forecasting, machine learning (ML) approaches have been recommended, in addition to the application of more traditional regression approaches.MethodsA prospective study was employed in home-based elderly Chinese, using baseline (2011) and follow-up (2013) data of the China Health and Retirement Longitudinal Study (CHARLS), a nationally representative cohort study. We compared four algorithms, including the regression-based models (logistic regression, lasso, ridge) and ML method (random forest). Model performance was assessed using repeated nested 10-fold cross-validation. As the main measure of predictive performance, we used the area under the receiver operating characteristic curve (AUC).ResultsThe mean AUCs of the four predictive models, logistic regression, lasso, ridge, and random forest, were 0.795, 0.794, 0.794, and 0.769, respectively. The main determinants were life satisfaction, self-reported memory, cognitive ability, ADL (activities of daily living) impairment, CESD-10 score. Life satisfaction increased the odds ratio of a future depression by 128.6% (logistic), 13.8% (lasso), and 13.2% (ridge), and cognitive ability was the most important predictor in random forest.ConclusionsThe three regression-based models and one ML algorithm performed equally well in differentiating between a future depression case and a non-depression case in home-based elderly. When choosing a model, different considerations, however, such as easy operating, might in some instances lead to one model being prioritized over another.

Download Full-text

Klasifikasi Jenis Pemeliharaan dan Perawatan Container Crane menggunakan Algoritma Machine Learning

MATICS ◽

10.18860/mat.v13i1.11525 ◽

2021 ◽

Vol 13 (1) ◽

pp. 21-27

Author(s):

Via Ardianto Nugroho ◽

Derry Pramono Adi ◽

Achmad Teguh Wibowo ◽

MY Teguh Sulistyono ◽

Agustinus Bimo Gumelar

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Decision Tree ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Container Crane ◽

Model Tree

Pada industri jasa pelayanan peti kemas, Terminal Nilam merupakan pelanggan dari PT. BIMA, yang secara khusus bergerak dibidang jasa perbaikan dan perawatan alat berat. Terminal ini menjadi sentral tempat untuk melakukan aktifitas bongkar muat peti kemas domestik yang memiliki empat buah container crane untuk melayani dua kapal. Proses perawatan alat berat seperti container crane yang selama ini beroperasi, agaknya kurang memperhatikan data pengelompokkan atau klasifikasi jenis perawatan yang dibutuhkan oleh alat berat tersebut. Di kemudian hari, alat berat dapat menunjukkan kinerja yang tidak maksimal bahkan dapat berujung pada kecelakaan kerja. Selain itu, kelalaian perawatan container crane juga dapat menyebabkan pembengkakan biaya perawatan lanjut. Target produksi bongkar muat dapat berkurang dan juga keterlambatan jadwal kapal sandar sangat mungkin terjadi. Metode pembelajaran menggunakan mesin atau biasa disebut dengan Machine Learning (ML), dengan mudah dapat melenyapkan kemungkinan-kemungkinan tersebut. ML dalam penelitian ini, kami rancang agar bekerja dengan mengidentifikasi lalu mengelompokkan jenis perawatan container crane yang sesuai, yaitu ringan atau berat. Metode ML yang pilih untuk digunakan dalam penelitian ini yaitu Random Forest, Support Vector Machine, k-Nearest Neighbor, Naïve Bayes, Logistic Regression, J48, dan Decision Tree. Penelitian ini menunjukkan keberhasilan ML model tree dalam melakukan pembelajaran jenis data perawatan container crane (numerik dan kategoris), dengan J48 menunjukkan performa terbaik dengan nilai akurasi dan nilai ROC-AUC mencapai 99,1%. Pertimbangan klasifikasi kami lakukan dengan mengacu kepada tanggal terakhir perawatan, hour meter, breakdown, shutdown, dan sparepart.

Download Full-text

Comparative Analysis of Intellectual Methods for Muscular Contraction Interpretation for Gesture Interface Implementation

Journal of Physics Conference Series ◽

10.1088/1742-6596/2096/1/012190 ◽

2021 ◽

Vol 2096 (1) ◽

pp. 012190

Author(s):

E V Bunyaeva ◽

I V Kuznetsov ◽

Y V Ponomarchuk ◽

P S Timosh

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Logistic Regression ◽

Comparative Analysis ◽

Random Forest ◽

Decision Tree ◽

Single Channel ◽

Muscular Contraction ◽

Support Vector ◽

Machine Learning Methods

Abstract The paper considers comparative analysis results of the machine learning methods used for the gesture recognition based on the surface single-channel electromyography (sEMG) data. The data were processed using multilayer perceptron, support vector machine, decision tree ensemble (Random Forest) and logistic regression for the chosen four gesture types. The conclusion was derived on the analysis efficiency of these methods using commonly recommended accuracy metrics.

Download Full-text

Perbandingan Algoritma Machine Learning dalam Menilai Sebuah Lokasi Toko Ritel

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v7i1.3182 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Kristiawan Kristiawan ◽

Andreas Widjaja

Keyword(s):

Neural Network ◽

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Pearson Correlation ◽

Recursive Feature Elimination ◽

Support Vector ◽

Learning Technology ◽

K Nearest Neighbor ◽

Store Location

Abstract — The application of machine learning technology in various industrial fields is currently developing rapidly, including in the retail industry. This study aims to find the most accurate algorithmic model so that it can be used to help retailers choose a store location more precisely. By using several methods such as Pearson Correlation, Chi-Square Features, Recursive Feature Elimination and Tree-based to select features (predictive variables). These features are then used to train and build models using 6 different classification algorithms such as Logistic Regression, K Nearest Neighbor (KNN), Decision Tree, Random Forest, Support Vector Machine (SVM) and Neural Network to classify whether a location is recommended or not as a new store location. Keywords— Application of Machine Learning, Pearson Correlation, Random Forest, Neural Network, Logistic Regression.

Download Full-text

Prediction of Prostate Cancer using Machine Learning Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6754.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 5353-5362

Keyword(s):

Prostate Cancer ◽

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Missing Values ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Cancer Disease ◽

The Individual

Background/Aim: Prostate cancer is regarded as the most prevalent cancer in the word and the main cause of deaths worldwide. The early strategies for estimating the prostate cancer sicknesses helped in settling on choices about the progressions to have happened in high-chance patients which brought about the decrease of their dangers. Methods: In the proposed research, we have considered informational collection from kaggle and we have done pre-processing tasks for missing values .We have three missing data values in compactness attribute and two missing values in fractal dimension were replaced by mean of their column values .The performance of the diagnosis model is obtained by using methods like classification, accuracy, sensitivity and specificity analysis. This paper proposes a prediction model to predict whether a people have a prostate cancer disease or not and to provide an awareness or diagnosis on that. This is done by comparing the accuracies of applying rules to the individual results of Support Vector Machine, Random forest, Naive Bayes classifier and logistic regression on the dataset taken in a region to present an accurate model of predicting prostate cancer disease. Results: The machine learning algorithms under study were able to predict prostate cancer disease in patients with accuracy between 70% and 90%. Conclusions: It was shown that Logistic Regression and Random Forest both has better Accuracy (90%) when compared to different Machine-learning Algorithms.

Download Full-text