Development and Validation of Interpretable Machine Learning Approaches for Early Identification of Stroke in Older, Community Dwellers (Preprint)
BACKGROUND Prediction of stroke based on individuals’ risk factors, especially for a first stroke event, is of great significance for primary prevention of high-risk populations. OBJECTIVE This study aimed to investigate the applicability of machine learning for predicting stroke onset in older adults compared with statistical model. METHODS A total of 5960 participants consecutively surveyed from 2011 to 2013 in the China Health and Retirement Longitudinal Study were included for analysis. We constructed a traditional logistic regression (LR) and two machine learning methods, namely random forest (RF) and extreme gradient boosting (XGBoost), to identify stroke onset using epidemiological and clinical variables. Grid search and 10-fold cross validation were used to tune hyperparameters. Model performance was assessed by discrimination, calibration, decision curve and predictiveness curve analysis. RESULTS Among the 5960 participants, 131 (2.20%) of them developed stroke after an average of 2-year follow-up. Our prediction models distinguished stroke versus non-stroke with excellent performance. The AUCs of machine learning (RF, 0.823[95% CI, 0.759-0.886]; XGBoost, 0.808[95% CI, 0.730-0.886]) were significantly higher than LR (0.718[95% CI, 0.649, 0.787], p<0.05). No significant difference was observed between RF and XGBoost (p>0.05). All prediction models had good calibration results with brier score of approximately 0.020. XGBoost had much higher net benefits within a wider threshold range and more capable of recognizing high risk individuals in terms of decision curve and predictiveness curve analysis. Biomarker information were more capable for stroke prediction than epidemiological data. CONCLUSIONS Machine learning, especially for XGBoost, had potential to predict stroke onset among the elderly in the population-based study.