Abstract
Background: Prediction of stroke based on individuals’ risk factors, especially for a first stroke event, is of great significance for primary prevention of high-risk populations. Our study aimed to investigate the applicability of interpretable machine learning for predicting a 2-year stroke occurrence in older adults compared with logistic regression.Methods: A total of 5960 participants consecutively surveyed from July 2011 to August 2013 in the China Health and Retirement Longitudinal Study were included for analysis. We constructed a traditional logistic regression (LR) and two machine learning methods, namely random forest (RF) and extreme gradient boosting (XGBoost), to distinguish stroke occurrence versus non-stroke occurrence using data on demographics, lifestyle, disease history, and clinical variables. Grid search and 10-fold cross validation were used to tune the hyperparameters. Model performance was assessed by discrimination, calibration, decision curve and predictiveness curve analysis.Results: Among the 5960 participants, 131 (2.20%) of them developed stroke after an average of 2-year follow-up. Our prediction models distinguished stroke occurrence versus non-stroke occurrence with excellent performance. The AUCs of machine learning methods (RF, 0.823[95% CI, 0.759-0.886]; XGBoost, 0.808[95% CI, 0.730-0.886]) were significantly higher than LR (0.718[95% CI, 0.649, 0.787], p<0.05). No significant difference was observed between RF and XGBoost (p>0.05). All prediction models had good calibration results, and the brier score were 0.022 (95% CI, 0.015-0.028) in LR, 0.019 (95% CI, 0.014-0.025) in RF, and 0.020 (95% CI, 0.015-0.026) in XGBoost. XGBoost had much higher net benefits within a wider threshold range in terms of decision curve analysis, and more capable of recognizing high risk individuals in terms of predictiveness curve analysis. A total of eight predictors including gender, waist-to-height ratio, dyslipidemia, glycated hemoglobin, white blood cell count, blood glucose, triglycerides, and low-density lipoprotein cholesterol ranked top 5 in three prediction models.Conclusions: Machine learning methods, especially for XGBoost, had the potential to predict stroke occurrence compared with traditional logistic regression in the older adults.