Development and Validation of Interpretable Machine Learning Approaches for Early Identification of Stroke in Older, Community Dwellers (Preprint)

BACKGROUND Prediction of stroke based on individuals’ risk factors, especially for a first stroke event, is of great significance for primary prevention of high-risk populations. OBJECTIVE This study aimed to investigate the applicability of machine learning for predicting stroke onset in older adults compared with statistical model. METHODS A total of 5960 participants consecutively surveyed from 2011 to 2013 in the China Health and Retirement Longitudinal Study were included for analysis. We constructed a traditional logistic regression (LR) and two machine learning methods, namely random forest (RF) and extreme gradient boosting (XGBoost), to identify stroke onset using epidemiological and clinical variables. Grid search and 10-fold cross validation were used to tune hyperparameters. Model performance was assessed by discrimination, calibration, decision curve and predictiveness curve analysis. RESULTS Among the 5960 participants, 131 (2.20%) of them developed stroke after an average of 2-year follow-up. Our prediction models distinguished stroke versus non-stroke with excellent performance. The AUCs of machine learning (RF, 0.823[95% CI, 0.759-0.886]; XGBoost, 0.808[95% CI, 0.730-0.886]) were significantly higher than LR (0.718[95% CI, 0.649, 0.787], p<0.05). No significant difference was observed between RF and XGBoost (p>0.05). All prediction models had good calibration results with brier score of approximately 0.020. XGBoost had much higher net benefits within a wider threshold range and more capable of recognizing high risk individuals in terms of decision curve and predictiveness curve analysis. Biomarker information were more capable for stroke prediction than epidemiological data. CONCLUSIONS Machine learning, especially for XGBoost, had potential to predict stroke onset among the elderly in the population-based study.

Download Full-text

Development of an absolute assignment predictor for triple-negative breast cancer subtyping using machine learning approaches

10.1101/2020.06.02.129544 ◽

2020 ◽

Author(s):

Fadoua Ben Azzouz ◽

Bertrand Michel ◽

Hamza Lasla ◽

Wilfried Gouraud ◽

Anne-Flore François ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Triple Negative Breast Cancer ◽

Cross Validation ◽

Triple Negative ◽

Prediction Models ◽

Gradient Boosting ◽

Learning Approaches ◽

Extreme Gradient Boosting ◽

Tnbc Subtype

AbstractTriple-negative breast cancer (TNBC) heterogeneity represents one of the main impediment to precision medicine for this disease. Recent concordant transcriptomics studies have shown that TNBC could be splitted into at least three subtypes with potential therapeutic implications. Although, a few studies have been done to predict TNBC subtype by means of transcriptomics data, subtyping was partially sensitive and limited by batch effect and dependence to a given dataset, which may penalize the switch to routine diagnostic testing. Therefore, we sought to build an absolute predictor (i.e. intra-patient diagnosis) based on machine learning algorithm with a limited number of probes. To this end, we started by introducing probe binary comparison for each patient (indicators). We based predictive analysis on this transformed data. Probe selection was first performed by combining both filter and wrapper methods for variable selection using cross validation. We thus tested three prediction models (random forest, gradient boosting [GB] and extreme gradient boosting) using this optimal subset of indicators as inputs. Nested cross-validation allowed us to consistently choose the best model. Results showed that the 50 selected indicators highlighted biological characteristics associated with each TNBC subtype. The GB based on this subset of indicators has better performances as compared to the other models.

Download Full-text

Development and Validation of Interpretable Machine Learning for Stroke Occurrence in Older, Community Chinese Dwellers

10.21203/rs.3.rs-604690/v1 ◽

2021 ◽

Author(s):

Yafei Wu ◽

Zhongquan Jiang ◽

Shaowu Lin ◽

Ya Fang

Keyword(s):

Machine Learning ◽

Older Adults ◽

Logistic Regression ◽

High Risk ◽

Prediction Models ◽

Curve Analysis ◽

Learning Methods ◽

Machine Learning Methods ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting

Abstract Background: Prediction of stroke based on individuals’ risk factors, especially for a first stroke event, is of great significance for primary prevention of high-risk populations. Our study aimed to investigate the applicability of interpretable machine learning for predicting a 2-year stroke occurrence in older adults compared with logistic regression.Methods: A total of 5960 participants consecutively surveyed from July 2011 to August 2013 in the China Health and Retirement Longitudinal Study were included for analysis. We constructed a traditional logistic regression (LR) and two machine learning methods, namely random forest (RF) and extreme gradient boosting (XGBoost), to distinguish stroke occurrence versus non-stroke occurrence using data on demographics, lifestyle, disease history, and clinical variables. Grid search and 10-fold cross validation were used to tune the hyperparameters. Model performance was assessed by discrimination, calibration, decision curve and predictiveness curve analysis.Results: Among the 5960 participants, 131 (2.20%) of them developed stroke after an average of 2-year follow-up. Our prediction models distinguished stroke occurrence versus non-stroke occurrence with excellent performance. The AUCs of machine learning methods (RF, 0.823[95% CI, 0.759-0.886]; XGBoost, 0.808[95% CI, 0.730-0.886]) were significantly higher than LR (0.718[95% CI, 0.649, 0.787], p<0.05). No significant difference was observed between RF and XGBoost (p>0.05). All prediction models had good calibration results, and the brier score were 0.022 (95% CI, 0.015-0.028) in LR, 0.019 (95% CI, 0.014-0.025) in RF, and 0.020 (95% CI, 0.015-0.026) in XGBoost. XGBoost had much higher net benefits within a wider threshold range in terms of decision curve analysis, and more capable of recognizing high risk individuals in terms of predictiveness curve analysis. A total of eight predictors including gender, waist-to-height ratio, dyslipidemia, glycated hemoglobin, white blood cell count, blood glucose, triglycerides, and low-density lipoprotein cholesterol ranked top 5 in three prediction models.Conclusions: Machine learning methods, especially for XGBoost, had the potential to predict stroke occurrence compared with traditional logistic regression in the older adults.

Download Full-text

Predictive modeling for 14-day unplanned hospital readmission risk by using machine learning algorithms

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01639-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yu-Tai Lo ◽

Jay Chie-hen Liao ◽

Mei-Hua Chen ◽

Chia-Ming Chang ◽

Cheng-Te Li

Keyword(s):

Machine Learning ◽

High Risk ◽

Prediction Models ◽

Transitional Care ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Unplanned Readmission ◽

Extreme Gradient Boosting ◽

Readmission Risk

Abstract Background Early unplanned hospital readmissions are associated with increased harm to patients, increased medical costs, and negative hospital reputation. With the identification of at-risk patients, a crucial step toward improving care, appropriate interventions can be adopted to prevent readmission. This study aimed to build machine learning models to predict 14-day unplanned readmissions. Methods We conducted a retrospective cohort study on 37,091 consecutive hospitalized adult patients with 55,933 discharges between September 1, 2018, and August 31, 2019, in an 1193-bed university hospital. Patients who were aged < 20 years, were admitted for cancer-related treatment, participated in clinical trial, were discharged against medical advice, died during admission, or lived abroad were excluded. Predictors for analysis included 7 categories of variables extracted from hospital’s medical record dataset. In total, four machine learning algorithms, namely logistic regression, random forest, extreme gradient boosting, and categorical boosting, were used to build classifiers for prediction. The performance of prediction models for 14-day unplanned readmission risk was evaluated using precision, recall, F1-score, area under the receiver operating characteristic curve (AUROC), and area under the precision–recall curve (AUPRC). Results In total, 24,722 patients were included for the analysis. The mean age of the cohort was 57.34 ± 18.13 years. The 14-day unplanned readmission rate was 1.22%. Among the 4 machine learning algorithms selected, Catboost had the best average performance in fivefold cross-validation (precision: 0.9377, recall: 0.5333, F1-score: 0.6780, AUROC: 0.9903, and AUPRC: 0.7515). After incorporating 21 most influential features in the Catboost model, its performance improved (precision: 0.9470, recall: 0.5600, F1-score: 0.7010, AUROC: 0.9909, and AUPRC: 0.7711). Conclusions Our models reliably predicted 14-day unplanned readmissions and were explainable. They can be used to identify patients with a high risk of unplanned readmission based on influential features, particularly features related to diagnoses. The operation of the models with physiological indicators also corresponded to clinical experience and literature. Identifying patients at high risk with these models can enable early discharge planning and transitional care to prevent readmissions. Further studies should include additional features that may enable further sensitivity in identifying patients at a risk of early unplanned readmissions.

Download Full-text

Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

10.21203/rs.3.rs-21271/v1 ◽

2020 ◽

Author(s):

Si-Qiao Liang ◽

Jian-Xiong Long ◽

Jingmin Deng ◽

Xuan Wei ◽

Mei-Ling Yang ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Clinical Data ◽

Genome Wide Association Study ◽

Prediction Models ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Download Full-text

Understanding Multi-Vehicle Collision Patterns on Freeways—A Machine Learning Approach

Infrastructures ◽

10.3390/infrastructures5080062 ◽

2020 ◽

Vol 5 (8) ◽

pp. 62

Author(s):

Clint Morris ◽

Jidong J. Yang

Keyword(s):

Machine Learning ◽

Statistical Methods ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Learning Approaches ◽

Crash Analysis ◽

Suitable Alternative ◽

Crash Data ◽

Extreme Gradient Boosting ◽

Modern Machine

Generating meaningful inferences from crash data is vital to improving highway safety. Classic statistical methods are fundamental to crash data analysis and often regarded for their interpretability. However, given the complexity of crash mechanisms and associated heterogeneity, classic statistical methods, which lack versatility, might not be sufficient for granular crash analysis because of the high dimensional features involved in crash-related data. In contrast, machine learning approaches, which are more flexible in structure and capable of harnessing richer data sources available today, emerges as a suitable alternative. With the aid of new methods for model interpretation, the complex machine learning models, previously considered enigmatic, can be properly interpreted. In this study, two modern machine learning techniques, Linear Discriminate Analysis and eXtreme Gradient Boosting, were explored to classify three major types of multi-vehicle crashes (i.e., rear-end, same-direction sideswipe, and angle) occurred on Interstate 285 in Georgia. The study demonstrated the utility and versatility of modern machine learning methods in the context of crash analysis, particularly in understanding the potential features underlying different crash patterns on freeways.

Download Full-text

Machine Learning-Based Three-Month Outcome Prediction in Acute Ischemic Stroke: A Single Cerebrovascular-Specialty Hospital Study in South Korea

Diagnostics ◽

10.3390/diagnostics11101909 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1909

Author(s):

Dougho Park ◽

Eunhwan Jeong ◽

Haejong Kim ◽

Hae Wook Pyun ◽

Haemin Kim ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Ischemic Stroke ◽

Acute Ischemic Stroke ◽

Functional Outcome ◽

Outcome Prediction ◽

Prediction Models ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

Background: Functional outcomes after acute ischemic stroke are of great concern to patients and their families, as well as physicians and surgeons who make the clinical decisions. We developed machine learning (ML)-based functional outcome prediction models in acute ischemic stroke. Methods: This retrospective study used a prospective cohort database. A total of 1066 patients with acute ischemic stroke between January 2019 and March 2021 were included. Variables such as demographic factors, stroke-related factors, laboratory findings, and comorbidities were utilized at the time of admission. Five ML algorithms were applied to predict a favorable functional outcome (modified Rankin Scale 0 or 1) at 3 months after stroke onset. Results: Regularized logistic regression showed the best performance with an area under the receiver operating characteristic curve (AUC) of 0.86. Support vector machines represented the second-highest AUC of 0.85 with the highest F1-score of 0.86, and finally, all ML models applied achieved an AUC > 0.8. The National Institute of Health Stroke Scale at admission and age were consistently the top two important variables for generalized logistic regression, random forest, and extreme gradient boosting models. Conclusions: ML-based functional outcome prediction models for acute ischemic stroke were validated and proven to be readily applicable and useful.

Download Full-text

Gradient Boosting Machine Learning to Improve Satellite-Derived Column Water Vapor Measurement Error

10.5194/amt-2019-308 ◽

2019 ◽

Cited By ~ 1

Author(s):

Allan C. Just ◽

Yang Liu ◽

Meytar Sorek-Hamer ◽

Johnathan Rush ◽

Michael Dorman ◽

...

Keyword(s):

Machine Learning ◽

Water Vapor ◽

Measurement Error ◽

Earth Science ◽

Atmospheric Correction ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Learning Approaches ◽

Sensing Applications ◽

Extreme Gradient Boosting

Abstract. The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at 1 km resolution, derived from daily overpasses of NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson’s R > 0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measures at AERONET sites (26.9 % and 16.5 % decrease in Root Mean Square Error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000–2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 19.7 % and 9.5 % for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a post-processing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.

Download Full-text

Gradient boosting machine learning to improve satellite-derived column water vapor measurement error

Atmospheric Measurement Techniques ◽

10.5194/amt-13-4669-2020 ◽

2020 ◽

Vol 13 (9) ◽

pp. 4669-4681

Author(s):

Allan C. Just ◽

Yang Liu ◽

Meytar Sorek-Hamer ◽

Johnathan Rush ◽

Michael Dorman ◽

...

Keyword(s):

Machine Learning ◽

Water Vapor ◽

Measurement Error ◽

Earth Science ◽

Atmospheric Correction ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Learning Approaches ◽

Sensing Applications ◽

Extreme Gradient Boosting

Abstract. The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at a 1 km resolution, derived from daily overpasses of NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson's R > 0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model, with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measurements at AERONET sites (26.9 % and 16.5 % decrease in root mean square error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000–2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 19.7 % and 9.5 % for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a postprocessing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.

Download Full-text

Prediction of Masked Hypertension and Masked Uncontrolled Hypertension Using Machine Learning

Frontiers in Cardiovascular Medicine ◽

10.3389/fcvm.2021.778306 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ming-Hui Hung ◽

Ling-Chieh Shih ◽

Yu-Ching Wang ◽

Hsin-Bang Leu ◽

Po-Hsun Huang ◽

...

Keyword(s):

Machine Learning ◽

Clinical Characteristics ◽

Prediction Models ◽

External Validation ◽

Uncontrolled Hypertension ◽

Gradient Boosting ◽

Masked Hypertension ◽

Internal Validation ◽

Hypertensive Patients ◽

Extreme Gradient Boosting

Objective: This study aimed to develop machine learning-based prediction models to predict masked hypertension and masked uncontrolled hypertension using the clinical characteristics of patients at a single outpatient visit.Methods: Data were derived from two cohorts in Taiwan. The first cohort included 970 hypertensive patients recruited from six medical centers between 2004 and 2005, which were split into a training set (n = 679), a validation set (n = 146), and a test set (n = 145) for model development and internal validation. The second cohort included 416 hypertensive patients recruited from a single medical center between 2012 and 2020, which was used for external validation. We used 33 clinical characteristics as candidate variables to develop models based on logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGboost), and artificial neural network (ANN).Results: The four models featured high sensitivity and high negative predictive value (NPV) in internal validation (sensitivity = 0.914–1.000; NPV = 0.853–1.000) and external validation (sensitivity = 0.950–1.000; NPV = 0.875–1.000). The RF, XGboost, and ANN models showed much higher area under the receiver operating characteristic curve (AUC) (0.799–0.851 in internal validation, 0.672–0.837 in external validation) than the LR model. Among the models, the RF model, composed of 6 predictor variables, had the best overall performance in both internal and external validation (AUC = 0.851 and 0.837; sensitivity = 1.000 and 1.000; specificity = 0.609 and 0.580; NPV = 1.000 and 1.000; accuracy = 0.766 and 0.721, respectively).Conclusion: An effective machine learning-based predictive model that requires data from a single clinic visit may help to identify masked hypertension and masked uncontrolled hypertension.

Download Full-text

Applying Deep Neural Networks and Ensemble Machine Learning Methods to Forecast Airborne Ambrosia Pollen

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16111992 ◽

2019 ◽

Vol 16 (11) ◽

pp. 1992 ◽

Cited By ~ 6

Author(s):

Gebreab K. Zewdie ◽

David J. Lary ◽

Estelle Levetin ◽

Gemechu F. Garuma

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Land Surface ◽

Deep Neural Networks ◽

Airborne Pollen ◽

Training Data ◽

Gradient Boosting ◽

Learning Approaches ◽

Ambrosia Pollen ◽

Extreme Gradient Boosting

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.

Download Full-text