Abstract TP307: Validation of a Machine Learning Approach to Determine Stroke Severity of Patients Diagnosed With Stroke in Claims Data
Introduction: The National Institutes of Health Stroke Scale (NIHSS) scores are often not readily available in structured claims databases. We have previously demonstrated that a machine learning model can be used to determine proxies for NIHSS scores. Our current work focuses on creating a model applicable across different databases to validate our approach and enable further outcome studies. Methods: We identified 1,415 eligible hospital-admitted patients in the Optum® de-identified Integrated Claims-EMR database who were diagnosed with ischemic or hemorrhagic stroke, or a transient ischemic attack and had NIHSS scores in medical notes. These patients were split into a training (N=1,192) set for model development and a hold-out test (N=223) set to evaluate model performance. Furthermore, model performance was externally validated using the 286 eligible stroke patients in IBM’s Claims-EMR database (CED). Potential predictors for stroke severity included relevant procedures, diagnoses, patient demographics, and information about the patient hospital stay. Results: The optimal model, a random forest model, achieved a coefficient of determination (R 2 ) between the actual and predicted NIHSS scores in the hold-out Optum dataset of 0.48 and of 0.42 in the secondary CED dataset. The final model incorporated a total of 47 predictors. The strongest predictors included transient ischemic attack diagnosis, length of hospital stay, critical care procedures, patient age, and hemiplegia diagnosis. Conclusion: This study shows that machine learning can be used to determine proxies for NIHSS scores across different real-world databases. Ultimately, this will enable large claims-based outcome studies involving stroke severity to improve our understanding of how stroke severity affects healthcare utilization, total cost of care, and the financial impact on the larger community.