Rule-based Natural Language Processing for Automation of Stroke Data Extraction: A Validation Study (Preprint)
BACKGROUND Data extraction from radiology free-text reports is time-consuming when performed manually. Recently, more automated extraction methods using natural language processing (NLP) are proposed. A previously developed rule-based NLP algorithm showed promise in its ability to extract stroke-related data from radiology reports. OBJECTIVE We aimed to externally validate the accuracy of CHARTextract, a rule-based NLP algorithm, to extract stroke-related data from free-text radiology reports. METHODS Free-text reports of CT angiography (CTA) and perfusion (CTP) studies of consecutive patients with acute ischemic stroke admitted to a regional Stroke center for endovascular thrombectomy were analyzed from January 2015 - 2021. Stroke-related variables were manually extracted (reference standard) from the reports, including proximal and distal anterior circulation occlusion, posterior circulation occlusion, presence of ischemia, hemorrhage, Alberta stroke program early CT score (ASPECTS), and collateral status. These variables were simultaneously extracted using a rule-based NLP algorithm. The NLP algorithm's accuracy, specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) were assessed. RESULTS The NLP algorithm's accuracy was >90% for identifying distal anterior occlusion, posterior circulation occlusion, hemorrhage, and ASPECTS. Accuracy was 85%, 74%, and 79% for proximal anterior circulation occlusion, presence of ischemia, and collateral status respectively. The algorithm had an accuracy of 87-100% for the detection of variables not reported in radiology reports. CONCLUSIONS Rule-based NLP has a moderate to good performance for stroke-related data extraction from free-text imaging reports. The algorithm's accuracy was affected by inconsistent report styles and lexicon among reporting radiologists.