The Food and Drug Administration Biologics Effectiveness and Safety Initiative Facilitates Detection of Vaccine Administrations From Unstructured Data in Medical Records Through Natural Language Processing

Introduction: The Food and Drug Administration Center for Biologics Evaluation and Research conducts post-market surveillance of biologic products to ensure their safety and effectiveness. Studies have found that common vaccine exposures may be missing from structured data elements of electronic health records (EHRs), instead being captured in clinical notes. This impacts monitoring of adverse events following immunizations (AEFIs). For example, COVID-19 vaccines have been regularly administered outside of traditional medical settings. We developed a natural language processing (NLP) algorithm to mine unstructured clinical notes for vaccinations not captured in structured EHR data.Methods: A random sample of 1,000 influenza vaccine administrations, representing 995 unique patients, was extracted from a large U.S. EHR database. NLP techniques were used to detect administrations from the clinical notes in the training dataset [80% (N = 797) of patients]. The algorithm was applied to the validation dataset [20% (N = 198) of patients] to assess performance. Full medical charts for 28 randomly selected administration events in the validation dataset were reviewed by clinicians. The NLP algorithm was then applied across the entire dataset (N = 995) to quantify the number of additional events identified.Results: A total of 3,199 administrations were identified in the structured data and clinical notes combined. Of these, 2,740 (85.7%) were identified in the structured data, while the NLP algorithm identified 1,183 (37.0%) administrations in clinical notes; 459 were not also captured in the structured data. This represents a 16.8% increase in the identification of vaccine administrations compared to using structured data alone. The validation of 28 vaccine administrations confirmed 27 (96.4%) as “definite” vaccine administrations; 18 (64.3%) had evidence of a vaccination event in the structured data, while 10 (35.7%) were found solely in the unstructured notes.Discussion: We demonstrated the utility of an NLP algorithm to identify vaccine administrations not captured in structured EHR data. NLP techniques have the potential to improve detection of vaccine administrations not otherwise reported without increasing the analysis burden on physicians or practitioners. Future applications could include refining estimates of vaccine coverage and detecting other exposures, population characteristics, and outcomes not reliably captured in structured EHR data.

Download Full-text

Natural Language Processing to Identify Cancer Treatments With Electronic Medical Records

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00173 ◽

2021 ◽

pp. 379-393

Author(s):

Jiaming Zeng ◽

Imon Banerjee ◽

A. Solomon Henry ◽

Douglas J. Wood ◽

Ross D. Shachter ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Medical Records ◽

Language Processing ◽

Medical Records ◽

Structured Data ◽

Free Text ◽

Esophagus Cancer ◽

Clinical Notes ◽

Patients With Cancer

PURPOSE Knowing the treatments administered to patients with cancer is important for treatment planning and correlating treatment patterns with outcomes for personalized medicine study. However, existing methods to identify treatments are often lacking. We develop a natural language processing approach with structured electronic medical records and unstructured clinical notes to identify the initial treatment administered to patients with cancer. METHODS We used a total number of 4,412 patients with 483,782 clinical notes from the Stanford Cancer Institute Research Database containing patients with nonmetastatic prostate, oropharynx, and esophagus cancer. We trained treatment identification models for each cancer type separately and compared performance of using only structured, only unstructured ( bag-of-words, doc2vec, fasttext), and combinations of both ( structured + bow, structured + doc2vec, structured + fasttext). We optimized the identification model among five machine learning methods (logistic regression, multilayer perceptrons, random forest, support vector machines, and stochastic gradient boosting). The treatment information recorded in the cancer registry is the gold standard and compares our methods to an identification baseline with billing codes. RESULTS For prostate cancer, we achieved an f1-score of 0.99 (95% CI, 0.97 to 1.00) for radiation and 1.00 (95% CI, 0.99 to 1.00) for surgery using structured + doc2vec. For oropharynx cancer, we achieved an f1-score of 0.78 (95% CI, 0.58 to 0.93) for chemoradiation and 0.83 (95% CI, 0.69 to 0.95) for surgery using doc2vec. For esophagus cancer, we achieved an f1-score of 1.0 (95% CI, 1.0 to 1.0) for both chemoradiation and surgery using all combinations of structured and unstructured data. We found that employing the free-text clinical notes outperforms using the billing codes or only structured data for all three cancer types. CONCLUSION Our results show that treatment identification using free-text clinical notes greatly improves upon the performance using billing codes and simple structured data. The approach can be used for treatment cohort identification and adapted for longitudinal cancer treatment identification.

Download Full-text

The Use of Natural Language Processing to Ascertain Suicide Ideation/Attempt from Clinical Notes Within a Large Integrated Healthcare System (Preprint)

10.2196/preprints.28060 ◽

2021 ◽

Author(s):

Fagen Xie ◽

Deborah S Ling-Grant ◽

John Chang ◽

Britta I Amundsen ◽

Rulin C Hechter

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Language Processing ◽

Suicide Ideation ◽

Validation Dataset ◽

Health Record ◽

Integrated Healthcare ◽

Entire Study ◽

Clinical Notes

UNSTRUCTURED Purpose: Identifying risk factors for suicide using progress notes and administrative data is time consuming and usually requires manual case review. In this study, a natural language processing computerized algorithm was developed and implemented to automatically ascertain suicide ideation/attempt from clinical notes in a large integrated healthcare system, Kaiser Permanente Southern California. Methods: Clinical notes containing prespecified relevant keywords and phrases related to suicidal ideation/attempt between 2010 and 2018 were extracted from our organization’s electronic health record system. A random sample of 864 clinical notes was selected and equally divided into four subsets. These subsets were reviewed and classified as one of the following three suicide ideation/attempt categories: “Current”, “Historical” and “No” for each note by experienced research chart abstractors. The first three training datasets were used to develop the rule-based computerized algorithm sequentially and the fourth validation dataset was used to evaluate the algorithm performance. The validated algorithm was then applied to the entire study sample of clinical notes. Results: The computerized algorithm ascertained 23 of the 26 confirmed “Current” suicide ideation/attempt events and all 10 confirmed “Historical” suicide ideation/attempt events in the validation dataset. This algorithm produced a 88.5% sensitivity and 100.0% positive predictive value (PPV) for “Current” suicide ideation/attempt, and a 100.0% sensitivity and 100.0% PPV for “Historical” suicide ideation/attempt. After applying the computerized process to the entire study population sample, we identified a total of 1,050,289 “Current” ideation/attempt events and 293,038 “Historical” ideation/attempt events during the study period. Among the 400,436 individuals who were identified as having a “Current” suicide ideation/attempt event, 115,197 (28.8%) were 15-24 years old at the first event, 234,924 (58.7%) were female, 165,084 (41.7%) were Hispanic, and 150,645 (37.6%) had two or more events in the study period. Conclusions: Our study demonstrated that a natural language processing computerized algorithm can effectively ascertain suicide ideation/attempt from the free-text clinical notes in the electronic health record of a diverse patient population. This algorithm can be utilized in support of suicide prevention programs and patient care management.

Download Full-text

PMH52 Use of a Natural Language Processing-Based Approach to Extract Suicide Ideation and Behavior from Clinical Notes to Support Depression Research

Value in Health ◽

10.1016/j.jval.2021.04.674 ◽

2021 ◽

Vol 24 ◽

pp. S137

Author(s):

N. Palmon ◽

S. Momen ◽

M. Leavy ◽

G. Curhan ◽

C. Boussios ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Suicide Ideation ◽

Clinical Notes ◽

And Behavior

Download Full-text

Systematic review of current natural language processing methods and applications in cardiology

Heart ◽

10.1136/heartjnl-2021-319769 ◽

2021 ◽

pp. heartjnl-2021-319769

Author(s):

Meghan Reading Turchioe ◽

Alexander Volodarskiy ◽

Jyotishman Pathak ◽

Drew N Wright ◽

James Enlou Tcheng ◽

...

Keyword(s):

Systematic Review ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Care ◽

Real World Data ◽

Clinical Text ◽

Clinical Notes ◽

Artery Disease ◽

Automated Methods

Natural language processing (NLP) is a set of automated methods to organise and evaluate the information contained in unstructured clinical notes, which are a rich source of real-world data from clinical care that may be used to improve outcomes and understanding of disease in cardiology. The purpose of this systematic review is to provide an understanding of NLP, review how it has been used to date within cardiology and illustrate the opportunities that this approach provides for both research and clinical care. We systematically searched six scholarly databases (ACM Digital Library, Arxiv, Embase, IEEE Explore, PubMed and Scopus) for studies published in 2015–2020 describing the development or application of NLP methods for clinical text focused on cardiac disease. Studies not published in English, lacking a description of NLP methods, non-cardiac focused and duplicates were excluded. Two independent reviewers extracted general study information, clinical details and NLP details and appraised quality using a checklist of quality indicators for NLP studies. We identified 37 studies developing and applying NLP in heart failure, imaging, coronary artery disease, electrophysiology, general cardiology and valvular heart disease. Most studies used NLP to identify patients with a specific diagnosis and extract disease severity using rule-based NLP methods. Some used NLP algorithms to predict clinical outcomes. A major limitation is the inability to aggregate findings across studies due to vastly different NLP methods, evaluation and reporting. This review reveals numerous opportunities for future NLP work in cardiology with more diverse patient samples, cardiac diseases, datasets, methods and applications.

Download Full-text

Development of algorithm for classification smoking status from unstructured bilingual electronic health records based on natural language processing (Preprint)

10.2196/preprints.26978 ◽

2021 ◽

Author(s):

Ye Seul Bae ◽

Kyung Hwan Kim ◽

Han Kyul Kim ◽

Sae Won Choi ◽

Taehoon Ko ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Smoking Status ◽

Svm Classifier ◽

Keyword Extraction ◽

Health Records ◽

Clinical Notes ◽

Electronic Health

BACKGROUND Smoking is a major risk factor and important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). OBJECTIVE We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). METHODS With acronym replacement and Python package Soynlp, we normalize 4,711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. RESULTS Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual clinical notes. Given an identical SVM classifier, the extracted keywords improve the F1 score by as much as 1.8% compared to those of the unigram and bigram Bag of Words. CONCLUSIONS Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired and used for clinical practice and research.

Download Full-text

Natural Language Processing for Rapid Response to Emergent Diseases: Case Study of Calcium Channel Blockers and Hypertension in the COVID-19 Pandemic

Journal of Medical Internet Research ◽

10.2196/20773 ◽

2020 ◽

Vol 22 (8) ◽

pp. e20773 ◽

Cited By ~ 1

Author(s):

Antoine Neuraz ◽

Ivan Lerner ◽

William Digan ◽

Nicolas Paris ◽

Rosy Tsopra ◽

...

Keyword(s):

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Calcium Channel ◽

Language Processing ◽

Calcium Channel Blockers ◽

Structured Data ◽

Channel Blockers ◽

Knowledge Model ◽

Long Term Treatment

Background A novel disease poses special challenges for informatics solutions. Biomedical informatics relies for the most part on structured data, which require a preexisting data or knowledge model; however, novel diseases do not have preexisting knowledge models. In an emergent epidemic, language processing can enable rapid conversion of unstructured text to a novel knowledge model. However, although this idea has often been suggested, no opportunity has arisen to actually test it in real time. The current coronavirus disease (COVID-19) pandemic presents such an opportunity. Objective The aim of this study was to evaluate the added value of information from clinical text in response to emergent diseases using natural language processing (NLP). Methods We explored the effects of long-term treatment by calcium channel blockers on the outcomes of COVID-19 infection in patients with high blood pressure during in-patient hospital stays using two sources of information: data available strictly from structured electronic health records (EHRs) and data available through structured EHRs and text mining. Results In this multicenter study involving 39 hospitals, text mining increased the statistical power sufficiently to change a negative result for an adjusted hazard ratio to a positive one. Compared to the baseline structured data, the number of patients available for inclusion in the study increased by 2.95 times, the amount of available information on medications increased by 7.2 times, and the amount of additional phenotypic information increased by 11.9 times. Conclusions In our study, use of calcium channel blockers was associated with decreased in-hospital mortality in patients with COVID-19 infection. This finding was obtained by quickly adapting an NLP pipeline to the domain of the novel disease; the adapted pipeline still performed sufficiently to extract useful information. When that information was used to supplement existing structured data, the sample size could be increased sufficiently to see treatment effects that were not previously statistically detectable.

Download Full-text

Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation (Preprint)

10.2196/preprints.17376 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sijia Liu ◽

Yanshan Wang ◽

Andrew Wen ◽

Liwei Wang ◽

Na Hong ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Structured Data ◽

Common Data Model ◽

Concept System ◽

Unstructured Text ◽

Electronic Health

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.

Download Full-text

Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2019.103354 ◽

2020 ◽

Vol 102 ◽

pp. 103354

Author(s):

Jackson M. Steinkamp ◽

Wasif Bala ◽

Abhinav Sharma ◽

Jacob J. Kantrowitz

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Notes ◽

Task Definition

Download Full-text

Identifying Heart Failure Symptoms and Poor Self-Management in Home Healthcare: A Natural Language Processing Study

10.3233/shti210653 ◽

2021 ◽

Author(s):

Sena Chae ◽

Jiyoun Song ◽

Marietta Ojo ◽

Maxim Topaz

Keyword(s):

Heart Failure ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Symptom Management ◽

Home Healthcare ◽

Self Management ◽

Clinical Notes ◽

Patients With Heart Failure

The goal of this natural language processing (NLP) study was to identify patients in home healthcare with heart failure symptoms and poor self-management (SM). The preliminary lists of symptoms and poor SM status were identified, NLP algorithms were used to refine the lists, and NLP performance was evaluated using 2.3 million home healthcare clinical notes. The overall precision to identify patients with heart failure symptoms and poor SM status was 0.86. The feasibility of methods was demonstrated to identify patients with heart failure symptoms and poor SM documented in home healthcare notes. This study facilitates utilizing key symptom information and patients’ SM status from unstructured data in electronic health records. The results of this study can be applied to better individualize symptom management to support heart failure patients’ quality-of-life.

Download Full-text

A Natural Language Processing Approach to Automated Highlighting of New Information in Clinical Notes

Applied Sciences ◽

10.3390/app10082824 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2824

Author(s):

Yu-Hsiang Su ◽

Ching-Ping Chao ◽

Ling-Chien Hung ◽

Sheng-Feng Sung ◽

Pei-Ju Lee

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Task Performance ◽

Language Processing ◽

Automated Identification ◽

Clinical Notes ◽

New Information ◽

Perceived Workload ◽

The Impact ◽

User Experiment

Electronic medical records (EMRs) have been used extensively in most medical institutions for more than a decade in Taiwan. However, information overload associated with rapid accumulation of large amounts of clinical narratives has threatened the effective use of EMRs. This situation is further worsened by the use of “copying and pasting”, leading to lots of redundant information in clinical notes. This study aimed to apply natural language processing techniques to address this problem. New information in longitudinal clinical notes was identified based on a bigram language model. The accuracy of automated identification of new information was evaluated using expert annotations as the reference standard. A two-stage cross-over user experiment was conducted to evaluate the impact of highlighting of new information on task demands, task performance, and perceived workload. The automated method identified new information with an F1 score of 0.833. The user experiment found a significant decrease in perceived workload associated with a significantly higher task performance. In conclusion, automated identification of new information in clinical notes is feasible and practical. Highlighting of new information enables healthcare professionals to grasp key information from clinical notes with less perceived workload.

Download Full-text