Development of a Post-Acute Sequelae of COVID-19 (PASC) Symptom Lexicon Using Electronic Health Record Clinical Notes

Objective: To develop a comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon from clinical notes to support PASC symptom identification and research. Methods: We identified 26,117 COVID-19 positive patients from the Mass General Brigham's electronic health records (EHR) and extracted 328,879 clinical notes from their post-acute infection period (day 51-110 from first positive COVID-19 test). The PASC symptom lexicon incorporated Unified Medical Language System (UMLS) Metathesaurus concepts and synonyms based on selected semantic types. The MTERMS natural language processing (NLP) tool was used to automatically extract symptoms from a development dataset. The lexicon was iteratively revised with manual chart review, keyword search, concept consolidation, and evaluation of NLP output. We assessed the comprehensiveness of the lexicon and the NLP performance using a validation dataset and reported the symptom prevalence across the entire corpus. Results: The PASC symptom lexicon included 355 symptoms consolidated from 1,520 UMLS concepts. NLP achieved an averaged precision of 0.94 and an estimated recall of 0.84. Symptoms with the highest frequency included pain (43.1%), anxiety (25.8%), depression (24.0%), fatigue (23.4%), joint pain (21.0%), shortness of breath (20.8%), headache (20.0%), nausea and/or vomiting (19.9%), myalgia (19.0%), and gastroesophageal reflux (18.6%). Discussion and Conclusion: PASC symptoms are diverse. A comprehensive PASC symptom lexicon can be derived using a data-driven, ontology-driven and NLP-assisted approach. By using unstructured data, this approach may improve identification and analysis of patient symptoms in the EHR, and inform prospective study design, preventative care strategies, and therapeutic interventions for patient care.

Download Full-text

Ascertaining Framingham heart failure phenotype from inpatient electronic health record data using natural language processing: a multicentre Atherosclerosis Risk in Communities (ARIC) validation study

BMJ Open ◽

10.1136/bmjopen-2020-047356 ◽

2021 ◽

Vol 11 (6) ◽

pp. e047356

Author(s):

Carlton R Moore ◽

Saumya Jain ◽

Stephanie Haas ◽

Harish Yadav ◽

Eric Whitsel ◽

...

Keyword(s):

Language Processing ◽

Validation Dataset ◽

Free Text ◽

Electronic Health Record Data ◽

Atherosclerosis Risk In Communities ◽

Clinical Notes ◽

Atherosclerosis Risk ◽

Record Data ◽

Sensitivity Specificity ◽

Aric Study

ObjectivesUsing free-text clinical notes and reports from hospitalised patients, determine the performance of natural language processing (NLP) ascertainment of Framingham heart failure (HF) criteria and phenotype.Study designA retrospective observational study design of patients hospitalised in 2015 from four hospitals participating in the Atherosclerosis Risk in Communities (ARIC) study was used to determine NLP performance in the ascertainment of Framingham HF criteria and phenotype.SettingFour ARIC study hospitals, each representing an ARIC study region in the USA.ParticipantsA stratified random sample of hospitalisations identified using a broad range of International Classification of Disease, ninth revision, diagnostic codes indicative of an HF event and occurring during 2015 was drawn for this study. A randomly selected set of 394 hospitalisations was used as the derivation dataset and 406 hospitalisations was used as the validation dataset.InterventionUse of NLP on free-text clinical notes and reports to ascertain Framingham HF criteria and phenotype.Primary and secondary outcome measuresNLP performance as measured by sensitivity, specificity, positive-predictive value (PPV) and agreement in ascertainment of Framingham HF criteria and phenotype. Manual medical record review by trained ARIC abstractors was used as the reference standard.ResultsOverall, performance of NLP ascertainment of Framingham HF phenotype in the validation dataset was good, with 78.8%, 81.7%, 84.4% and 80.0% for sensitivity, specificity, PPV and agreement, respectively.ConclusionsBy decreasing the need for manual chart review, our results on the use of NLP to ascertain Framingham HF phenotype from free-text electronic health record data suggest that validated NLP technology holds the potential for significantly improving the feasibility and efficiency of conducting large-scale epidemiologic surveillance of HF prevalence and incidence.

Download Full-text

Development of algorithm for classification smoking status from unstructured bilingual electronic health records based on natural language processing (Preprint)

10.2196/preprints.26978 ◽

2021 ◽

Author(s):

Ye Seul Bae ◽

Kyung Hwan Kim ◽

Han Kyul Kim ◽

Sae Won Choi ◽

Taehoon Ko ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Smoking Status ◽

Svm Classifier ◽

Keyword Extraction ◽

Health Records ◽

Clinical Notes ◽

Electronic Health

BACKGROUND Smoking is a major risk factor and important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). OBJECTIVE We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). METHODS With acronym replacement and Python package Soynlp, we normalize 4,711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. RESULTS Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual clinical notes. Given an identical SVM classifier, the extracted keywords improve the F1 score by as much as 1.8% compared to those of the unigram and bigram Bag of Words. CONCLUSIONS Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired and used for clinical practice and research.

Download Full-text

Abstract P259: Using Natural Language Processing and Machine Learning to Identify Incident Stroke From Electronic Health Records

Circulation ◽

10.1161/circ.141.suppl_1.p259 ◽

2020 ◽

Vol 141 (Suppl_1) ◽

Author(s):

Yiqing Zhao ◽

Sunyang Fu ◽

Suzette J Bielinski ◽

Paul Decker ◽

Alanna M Chamberlain ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Language Processing ◽

Stroke Incidence ◽

Clinical Notes ◽

Feature Sets ◽

Electronic Health

Background: The focus of most existing phenotyping algorithms based on electronic health record (EHR) data has been to accurately identify cases and non-cases of specific diseases. However, a more challenging task is to accurately identify disease incidence, as identifying the first occurrence of disease is more important for efficient and valid clinical and epidemiological research. Moreover, stroke is a challenging phenotype due to diagnosis difficulty and common miscoding. This task generally requires utilization of multiple types of EHR data (e.g., diagnoses and procedure codes, unstructured clinical notes) and a more robust algorithm integrating both natural language processing and machine learning. In this study, we developed and validated an EHR-based classifier to accurately identify stroke incidence among a cohort of atrial fibrillation (AF) patients Methods: We developed a stroke phenotyping algorithm using International Classification of Diseases, Ninth Revision (ICD-9) codes, Current Procedural Terminology (CPT) codes, and expert-provided keywords as model features. Structured data was extracted from Rochester Epidemiology Project (REP) database. Natural Language Processing (NLP) was used to extract and validate keyword occurrence in clinical notes. A window of ±30 days was considered when including/excluding keywords/codes into the input vector. Frequencies of keywords/codes were used as input feature sets for model training. Multiple competing models were trained using various combinations of feature sets and two machine learning algorithms: logistic regression and random forest. Training data were provided by two nurse abstractors and included validated stroke incidences from a previously established atrial fibrillation cohort. Precision, recall, and F-score of the algorithm were calculated to assess and compare model performances. Results: Among 4,914 patients with atrial fibrillation, 1,773 patients were screened. 3,141 patients had no stroke-related codes or keywords and were presumed to be free of stroke during follow-up. Among the screened patients, 740 had validated strokes and 1,033 did not have a stroke based on review of the EHR by trained nurse abstractors. The best performing stroke incidence phenotyping classifier utilized Keywords+ICD-9+CPT features using a random forest classifier, achieving a precision of 0.942, recall of 0.943, and F-score of 0.943. Conclusion: In conclusion, we developed and validated a stroke algorithm that performed well for identifying stroke incidence in an enriched population (AF cohort), which extends beyond the typical binary case/non-case stroke identification problem. Future work will involve testing the generalizability of this algorithm in a general population.

Download Full-text

Acronym Disambiguation in Clinical Notes from Electronic Health Records

10.1101/2020.11.25.20221648 ◽

2020 ◽

Author(s):

Nicholas B. Link ◽

Selena Huang ◽

Tianrun Cai ◽

Zeling He ◽

Jiehuan Sun ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Electronic Health Records ◽

Language Processing ◽

Learning Approaches ◽

Health Records ◽

Hospital System ◽

Clinical Notes ◽

Unsupervised Method ◽

Electronic Health

ABSTRACTObjectiveThe use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce an unsupervised method for acronym disambiguation, the task of classifying the correct sense of acronyms in the clinical EHR notes.MethodsWe developed an unsupervised ensemble machine learning (CASEml) algorithm to automatically classify acronyms by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard unsupervised method and a baseline metric selecting the most frequent acronym sense. We additionally evaluated the effects of RA disambiguation on NLP-driven phenotyping of rheumatoid arthritis.ResultsCASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art unsupervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis.ConclusionCASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and unsupervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

Download Full-text

Clinical Text Mining of Electronic Health Records to Classify Leprosy Patients Cases

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e2464.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 2331-2336

Keyword(s):

Electronic Health Records ◽

Language Processing ◽

Neglected Tropical Diseases ◽

Support Vector ◽

Health Records ◽

Clinical Text ◽

Clinical Notes ◽

Leprosy Patients ◽

Electronic Health ◽

And Performance

Leprosy is one of the major public health problems and listed among the neglected tropical diseases in India. It is also called Hansen's Diseases (HD), which is a long haul contamination by microorganisms Mycobacterium leprae or Mycobacterium lepromatosis.Untreated, leprosy can dynamic and changeless harm to the skin, nerves, appendages, and eyes. This paper intends to depict classification of leprosy cases from the main indication of side effects. Electronic Health Records (EHRs) of Leprosy Patients from verified sources have been generated. The clinical notes included in EHRs have been processed through Natural Language Processing Tools. In order to predict type of leprosy, Rule based classification method has been proposed in this paper. Further our approach is compared with various Machine Learning (ML) algorithms like Support Vector Machine (SVM), Logistic regression (LR) and performance parameters are compared.

Download Full-text

Psychiatric stressor recognition from clinical notes to reveal association with suicide

Health Informatics Journal ◽

10.1177/1460458218796598 ◽

2018 ◽

Vol 25 (4) ◽

pp. 1846-1862 ◽

Cited By ~ 3

Author(s):

Yaoyun Zhang ◽

Olivia R Zhang ◽

Rui Li ◽

Aaron Flores ◽

Salih Selek ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Suicidal Behaviors ◽

Statistical Association ◽

Clinical Text ◽

Clinical Notes ◽

Electronic Health ◽

F Measure

Suicide takes the lives of nearly a million people each year and it is a tremendous economic burden globally. One important type of suicide risk factor is psychiatric stress. Prior studies mainly use survey data to investigate the association between suicide and stressors. Very few studies have investigated stressor data in electronic health records, mostly due to the data being recorded in narrative text. This study takes the initiative to automatically extract and classify psychiatric stressors from clinical text using natural language processing–based methods. Suicidal behaviors were also identified by keywords. Then, a statistical association analysis between suicide ideations/attempts and stressors extracted from a clinical corpus is conducted. Experimental results show that our natural language processing method could recognize stressor entities with an F-measure of 89.01 percent. Mentions of suicidal behaviors were identified with an F-measure of 97.3 percent. The top three significant stressors associated with suicide are health, pressure, and death, which are similar to previous studies. This study demonstrates the feasibility of using natural language processing approaches to unlock information from psychiatric notes in electronic health record, to facilitate large-scale studies about associations between suicide and psychiatric stressors.

Download Full-text

The Use of Natural Language Processing to Ascertain Suicide Ideation/Attempt from Clinical Notes Within a Large Integrated Healthcare System (Preprint)

10.2196/preprints.28060 ◽

2021 ◽

Author(s):

Fagen Xie ◽

Deborah S Ling-Grant ◽

John Chang ◽

Britta I Amundsen ◽

Rulin C Hechter

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Language Processing ◽

Suicide Ideation ◽

Validation Dataset ◽

Health Record ◽

Integrated Healthcare ◽

Entire Study ◽

Clinical Notes

UNSTRUCTURED Purpose: Identifying risk factors for suicide using progress notes and administrative data is time consuming and usually requires manual case review. In this study, a natural language processing computerized algorithm was developed and implemented to automatically ascertain suicide ideation/attempt from clinical notes in a large integrated healthcare system, Kaiser Permanente Southern California. Methods: Clinical notes containing prespecified relevant keywords and phrases related to suicidal ideation/attempt between 2010 and 2018 were extracted from our organization’s electronic health record system. A random sample of 864 clinical notes was selected and equally divided into four subsets. These subsets were reviewed and classified as one of the following three suicide ideation/attempt categories: “Current”, “Historical” and “No” for each note by experienced research chart abstractors. The first three training datasets were used to develop the rule-based computerized algorithm sequentially and the fourth validation dataset was used to evaluate the algorithm performance. The validated algorithm was then applied to the entire study sample of clinical notes. Results: The computerized algorithm ascertained 23 of the 26 confirmed “Current” suicide ideation/attempt events and all 10 confirmed “Historical” suicide ideation/attempt events in the validation dataset. This algorithm produced a 88.5% sensitivity and 100.0% positive predictive value (PPV) for “Current” suicide ideation/attempt, and a 100.0% sensitivity and 100.0% PPV for “Historical” suicide ideation/attempt. After applying the computerized process to the entire study population sample, we identified a total of 1,050,289 “Current” ideation/attempt events and 293,038 “Historical” ideation/attempt events during the study period. Among the 400,436 individuals who were identified as having a “Current” suicide ideation/attempt event, 115,197 (28.8%) were 15-24 years old at the first event, 234,924 (58.7%) were female, 165,084 (41.7%) were Hispanic, and 150,645 (37.6%) had two or more events in the study period. Conclusions: Our study demonstrated that a natural language processing computerized algorithm can effectively ascertain suicide ideation/attempt from the free-text clinical notes in the electronic health record of a diverse patient population. This algorithm can be utilized in support of suicide prevention programs and patient care management.

Download Full-text

Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods

JMIR Medical Informatics ◽

10.2196/13039 ◽

2019 ◽

Vol 7 (1) ◽

pp. e13039 ◽

Cited By ~ 7

Author(s):

Tao Chen ◽

Mark Dredze ◽

Jonathan P Weiner ◽

Leilani Hernandez ◽

Joe Kimura ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Language Processing ◽

Health Record ◽

Geriatric Syndromes ◽

Processing Methods ◽

Clinical Notes ◽

Statistical Natural Language Processing ◽

Electronic Health

Download Full-text

The Food and Drug Administration Biologics Effectiveness and Safety Initiative Facilitates Detection of Vaccine Administrations From Unstructured Data in Medical Records Through Natural Language Processing

Frontiers in Digital Health ◽

10.3389/fdgth.2021.777905 ◽

2021 ◽

Vol 3 ◽

Author(s):

Matthew Deady ◽

Hussein Ezzeldin ◽

Kerry Cook ◽

Douglas Billings ◽

Jeno Pizarro ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Drug Administration ◽

Language Processing ◽

Food And Drug Administration ◽

Structured Data ◽

Population Characteristics ◽

Training Dataset ◽

Validation Dataset ◽

Clinical Notes

Introduction: The Food and Drug Administration Center for Biologics Evaluation and Research conducts post-market surveillance of biologic products to ensure their safety and effectiveness. Studies have found that common vaccine exposures may be missing from structured data elements of electronic health records (EHRs), instead being captured in clinical notes. This impacts monitoring of adverse events following immunizations (AEFIs). For example, COVID-19 vaccines have been regularly administered outside of traditional medical settings. We developed a natural language processing (NLP) algorithm to mine unstructured clinical notes for vaccinations not captured in structured EHR data.Methods: A random sample of 1,000 influenza vaccine administrations, representing 995 unique patients, was extracted from a large U.S. EHR database. NLP techniques were used to detect administrations from the clinical notes in the training dataset [80% (N = 797) of patients]. The algorithm was applied to the validation dataset [20% (N = 198) of patients] to assess performance. Full medical charts for 28 randomly selected administration events in the validation dataset were reviewed by clinicians. The NLP algorithm was then applied across the entire dataset (N = 995) to quantify the number of additional events identified.Results: A total of 3,199 administrations were identified in the structured data and clinical notes combined. Of these, 2,740 (85.7%) were identified in the structured data, while the NLP algorithm identified 1,183 (37.0%) administrations in clinical notes; 459 were not also captured in the structured data. This represents a 16.8% increase in the identification of vaccine administrations compared to using structured data alone. The validation of 28 vaccine administrations confirmed 27 (96.4%) as “definite” vaccine administrations; 18 (64.3%) had evidence of a vaccination event in the structured data, while 10 (35.7%) were found solely in the unstructured notes.Discussion: We demonstrated the utility of an NLP algorithm to identify vaccine administrations not captured in structured EHR data. NLP techniques have the potential to improve detection of vaccine administrations not otherwise reported without increasing the analysis burden on physicians or practitioners. Future applications could include refining estimates of vaccine coverage and detecting other exposures, population characteristics, and outcomes not reliably captured in structured EHR data.

Download Full-text

Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods (Preprint)

10.2196/preprints.13039 ◽

2018 ◽

Author(s):

Tao Chen ◽

Mark Dredze ◽

Jonathan P Weiner ◽

Leilani Hernandez ◽

Joe Kimura ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Geriatric Syndrome ◽

Free Text ◽

Geriatric Syndromes ◽

Health Maintenance ◽

Clinical Notes ◽

Statistical Natural Language Processing ◽

Electronic Health

BACKGROUND Geriatric syndromes in older adults are associated with adverse outcomes. However, despite being reported in clinical notes, these syndromes are often poorly captured by diagnostic codes in the structured fields of electronic health records (EHRs) or administrative records. OBJECTIVE We aim to automatically determine if a patient has any geriatric syndromes by mining the free text of associated EHR clinical notes. We assessed which statistical natural language processing (NLP) techniques are most effective. METHODS We applied conditional random fields (CRFs), a widely used machine learning algorithm, to identify each of 10 geriatric syndrome constructs in a clinical note. We assessed three sets of features and attributes for CRF operations: a base set, enhanced token, and contextual features. We trained the CRF on 3901 manually annotated notes from 85 patients, tuned the CRF on a validation set of 50 patients, and evaluated it on 50 held-out test patients. These notes were from a group of US Medicare patients over 65 years of age enrolled in a Medicare Advantage Health Maintenance Organization and cared for by a large group practice in Massachusetts. RESULTS A final feature set was formed through comprehensive feature ablation experiments. The final CRF model performed well at patient-level determination (macroaverage F1=0.834, microaverage F1=0.851); however, performance varied by construct. For example, at phrase-partial evaluation, the CRF model worked well on constructs such as absence of fecal control (F1=0.857) and vision impairment (F1=0.798) but poorly on malnutrition (F1=0.155), weight loss (F1=0.394), and severe urinary control issues (F1=0.532). Errors were primarily due to previously unobserved words (ie, out-of-vocabulary) and a lack of context. CONCLUSIONS This study shows that statistical NLP can be used to identify geriatric syndromes from EHR-extracted clinical notes. This creates new opportunities to identify patients with geriatric syndromes and study their health outcomes.

Download Full-text