Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals

Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites. Materials and Methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic. Results: Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar. Conclusion: This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.

Download Full-text

Detection of Bleeding Events in Electronic Health Record Notes Using Convolutional Neural Network Models Enhanced With Recurrent Neural Network Autoencoders: Deep Learning Approach (Preprint)

10.2196/preprints.10788 ◽

2018 ◽

Author(s):

Rumeng Li ◽

Baotian Hu ◽

Feifan Liu ◽

Weisong Liu ◽

Francesca Cunningham ◽

...

Keyword(s):

Neural Network ◽

Electronic Health Record ◽

Convolutional Neural Network ◽

Language Processing ◽

Characteristic Curve ◽

Bleeding Event ◽

Support Vector ◽

Health Record ◽

Bleeding Events ◽

Electronic Health

BACKGROUND Bleeding events are common and critical and may cause significant morbidity and mortality. High incidences of bleeding events are associated with cardiovascular disease in patients on anticoagulant therapy. Prompt and accurate detection of bleeding events is essential to prevent serious consequences. As bleeding events are often described in clinical notes, automatic detection of bleeding events from electronic health record (EHR) notes may improve drug-safety surveillance and pharmacovigilance. OBJECTIVE We aimed to develop a natural language processing (NLP) system to automatically classify whether an EHR note sentence contains a bleeding event. METHODS We expert annotated 878 EHR notes (76,577 sentences and 562,630 word-tokens) to identify bleeding events at the sentence level. This annotated corpus was used to train and validate our NLP systems. We developed an innovative hybrid convolutional neural network (CNN) and long short-term memory (LSTM) autoencoder (HCLA) model that integrates a CNN architecture with a bidirectional LSTM (BiLSTM) autoencoder model to leverage large unlabeled EHR data. RESULTS HCLA achieved the best area under the receiver operating characteristic curve (0.957) and F1 score (0.938) to identify whether a sentence contains a bleeding event, thereby surpassing the strong baseline support vector machines and other CNN and autoencoder models. CONCLUSIONS By incorporating a supervised CNN model and a pretrained unsupervised BiLSTM autoencoder, the HCLA achieved high performance in detecting bleeding events.

Download Full-text

Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data

JAMIA Open ◽

10.1093/jamiaopen/ooz056 ◽

2019 ◽

Vol 2 (4) ◽

pp. 570-579 ◽

Cited By ~ 5

Author(s):

Na Hong ◽

Andrew Wen ◽

Feichen Shen ◽

Sunghwan Sohn ◽

Chen Wang ◽

...

Keyword(s):

Electronic Health Record ◽

Language Processing ◽

Clinical Data ◽

Large Scale ◽

Structured Data ◽

Health Record ◽

Data Normalization ◽

Electronic Health Record Data ◽

Electronic Health ◽

Clinical Resource

Abstract Objective To design, develop, and evaluate a scalable clinical data normalization pipeline for standardizing unstructured electronic health record (EHR) data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification. Methods We established an FHIR-based clinical data normalization pipeline known as NLP2FHIR that mainly comprises: (1) a module for a core natural language processing (NLP) engine with an FHIR-based type system; (2) a module for integrating structured data; and (3) a module for content normalization. We evaluated the FHIR modeling capability focusing on core clinical resources such as Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory using Mayo Clinic’s unstructured EHR data. We constructed a gold standard reusing annotation corpora from previous NLP projects. Results A total of 30 mapping rules, 62 normalization rules, and 11 NLP-specific FHIR extensions were created and implemented in the NLP2FHIR pipeline. The elements that need to integrate structured data from each clinical resource were identified. The performance of unstructured data modeling achieved F scores ranging from 0.69 to 0.99 for various FHIR element representations (0.69–0.99 for Condition; 0.75–0.84 for Procedure; 0.71–0.99 for MedicationStatement; and 0.75–0.95 for FamilyMemberHistory). Conclusion We demonstrated that the NLP2FHIR pipeline is feasible for modeling unstructured EHR data and integrating structured elements into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data analytics, as well as useful insights for future developments of the FHIR specifications with regard to handling unstructured clinical data.

Download Full-text

The PCORnet Blood Pressure Control Laboratory

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.119.006115 ◽

2020 ◽

Vol 13 (3) ◽

Cited By ~ 1

Author(s):

Mark J. Pletcher ◽

Valy Fontil ◽

Thomas Carton ◽

Kathryn M. Shaw ◽

Myra Smith ◽

...

Keyword(s):

Blood Pressure ◽

Electronic Health Record ◽

Blood Pressure Control ◽

Comparative Effectiveness ◽

Pressure Control ◽

Health Record ◽

Control Laboratory ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health

Background: Uncontrolled blood pressure (BP) is a leading preventable cause of death that remains common in the US population despite the availability of effective medications. New technology and program innovation has high potential to improve BP but may be expensive and burdensome for patients, clinicians, health systems, and payers and may not produce desired results or reduce existing disparities in BP control. Methods and Results: The PCORnet Blood Pressure Control Laboratory is a platform designed to enable national surveillance and facilitate quality improvement and comparative effectiveness research. The platform uses PCORnet, the National Patient-Centered Clinical Research Network, for engagement of health systems and collection of electronic health record data, and the Eureka Research Platform for eConsent and collection of patient-reported outcomes and mHealth data from wearable devices and smartphones. Three demonstration projects are underway: BP track will conduct national surveillance of BP control and related clinical processes by measuring theory-derived pragmatic BP control metrics using electronic health record data, with a focus on tracking disparities over time; BP MAP will conduct a cluster-randomized trial comparing effectiveness of 2 versions of a BP control quality improvement program; BP Home will conduct an individual patient-level randomized trial comparing effectiveness of smartphone-linked versus standard home BP monitoring. Thus far, BP Track has collected electronic health record data from over 826 000 eligible patients with hypertension who completed ≈3.1 million ambulatory visits. Preliminary results demonstrate substantial room for improvement in BP control (<140/90 mm Hg), which was 58% overall, and in the clinical processes relevant for BP control. For example, only 12% of patients with hypertension with a high BP measurement during an ambulatory visit received an order for a new antihypertensive medication. Conclusions: The PCORnet Blood Pressure Control Laboratory is designed to be a reusable platform for efficient surveillance and comparative effectiveness research; results from demonstration projects are forthcoming.

Download Full-text

Natural Language Processing of the Unstructured Electronic Health Record Data Using Regular Expressions and SAS Hash Objects

Journal of Patient-Centered Research and Reviews ◽

10.17294/2330-0698.1147 ◽

2015 ◽

Vol 2 (2) ◽

pp. 118-119

Author(s):

Paul J Hitz ◽

Mitch Juusola ◽

Stephen C Waring ◽

Irina V Haller

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Language Processing ◽

Health Record ◽

Regular Expressions ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health

Download Full-text

A novel method of adverse event detection can accurately identify venous thromboembolisms (VTEs) from narrative electronic health record data

Journal of the American Medical Informatics Association ◽

10.1136/amiajnl-2014-002768 ◽

2014 ◽

Vol 22 (1) ◽

pp. 155-165 ◽

Cited By ~ 15

Author(s):

Christian M Rochefort ◽

Aman D Verma ◽

Tewodros Eguale ◽

Todd C Lee ◽

David L Buckeridge

Keyword(s):

Electronic Health Record ◽

Language Processing ◽

Area Under The Curve ◽

Support Vector ◽

Health Record ◽

Electronic Health Record Data ◽

Vein Thrombosis ◽

Radiology Reports ◽

Record Data ◽

Electronic Health

Abstract Background Venous thromboembolisms (VTEs), which include deep vein thrombosis (DVT) and pulmonary embolism (PE), are associated with significant mortality, morbidity, and cost in hospitalized patients. To evaluate the success of preventive measures, accurate and efficient methods for monitoring VTE rates are needed. Therefore, we sought to determine the accuracy of statistical natural language processing (NLP) for identifying DVT and PE from electronic health record data. Methods We randomly sampled 2000 narrative radiology reports from patients with a suspected DVT/PE in Montreal (Canada) between 2008 and 2012. We manually identified DVT/PE within each report, which served as our reference standard. Using a bag-of-words approach, we trained 10 alternative support vector machine (SVM) models predicting DVT, and 10 predicting PE. SVM training and testing was performed with nested 10-fold cross-validation, and the average accuracy of each model was measured and compared. Results On manual review, 324 (16.2%) reports were DVT-positive and 154 (7.7%) were PE-positive. The best DVT model achieved an average sensitivity of 0.80 (95% CI 0.76 to 0.85), specificity of 0.98 (98% CI 0.97 to 0.99), positive predictive value (PPV) of 0.89 (95% CI 0.85 to 0.93), and an area under the curve (AUC) of 0.98 (95% CI 0.97 to 0.99). The best PE model achieved sensitivity of 0.79 (95% CI 0.73 to 0.85), specificity of 0.99 (95% CI 0.98 to 0.99), PPV of 0.84 (95% CI 0.75 to 0.92), and AUC of 0.99 (95% CI 0.98 to 1.00). Conclusions Statistical NLP can accurately identify VTE from narrative radiology reports.

Download Full-text