Identification of Atherosclerotic and Cardiovascular Clinical Phenotypes in Spanish Electronic Health Records: Assessment of an Automated Information Extraction System. (Preprint)

2020 ◽  
Author(s):  
Ignacio Hernández-Medrano ◽  
Marisa Serrano ◽  
Sergio Collazo ◽  
Ana López-Ballesteros ◽  
Blai Coll ◽  
...  

BACKGROUND Research efforts to develop strategies to effectively identify patients and to reduce the burden of cardiovascular diseases is essential for the future of the health system. Most research studies have used only coded parts of electronic health records (EHRs) for case-detection, obtaining missed data cases and reducing study quality. Incorporating information from free-text into case-detection through Natural Language Processing (NLP) techniques improves research quality. SAVANA was born as an innovating data-driven system based on NLP and big data techniques designed to retrieve prominent biomedical information from narratives clinic notes and to maximize the huge amount of information contained in Spanish EHRs. OBJECTIVE The aim of this work if to assess the performance of SAVANA when identifying concepts within the cardiovascular domain in Spanish EHRs. METHODS SAVANA is a platform for acceleration of clinical research, based on real-time dynamic exploitation of all the information contained in EHRs corpora that uses its own technology (EHRead) to allow unstructured information contained in EHRs to be analysed and expressed by means of medical concepts that contain the most significant information in the text. RESULTS The evaluation corpus consisted of a stratified random sample of patients from 3 Spanish sites. For site 01, the corpus contained a total of 280 mentions of cardiovascular clinical entities, where 249 were correctly identified, obtaining a P=0.93. In site 02, SAVANA correctly detected 53 mentions of cardiovascular entities among 57 annotations, achieving a P=0.98; and in site 03, among 165 manual annotations, 75 were correctly identified, yielding a P= 0.99. CONCLUSIONS This research clearly demonstrates the ability of SAVANA at identifying mentions of atherosclerotic/cardiovascular clinical phenotype in Spanish EHRs, as well as retrieving patients and records related to this pathology.

2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
N Cruz ◽  
M Serrano ◽  
A Lopez ◽  
I H Medrano ◽  
J Lozano ◽  
...  

Abstract Background Research efforts to develop strategies to effectively identify patients and reduce the burden of cardiovascular diseases is essential for the future of the health system. Most research studies have used only coded parts of electronic health records (EHRs) for case-detection obtaining missed data cases, reducing study quality and in some case bias findings. Incorporating information from free-text into case-detection through Big Data and Artificial Intelligence techniques improves research quality. Savana has developed EHRead, a powerful technology that applies Natural Language Processing, Machine Learning and Deep Learning, to analyse and automatically extracts highly valuable medical information from unstructured free text contained in the EHR to support research and practice. Purpose We aimed to validate the linguistic accuracy performance of Savana, in terms of Precision (P), Recall (R) and overall performance (F-Score) in the cardiovascular domain since this is one of the most prevalent disease in the general population. This means validating the extent to which the Savana system identifies mentions to atherosclerotic/cardiovascular clinical phenotypes in EHRs. Methods The project was conducted in 3 Spanish sites and the system was validated using a corpus that consisted of 739 EHRs, including the emergency, medical and discharge records, written in free text. These EHRs were randomly selected from the total number of clinical documents generated during the period of 2012–2017 and were fully anonymized to comply with legal and ethical requirements. Two physicians per site reviewed records (randomly selected) and annotated all direct references to atherosclerotic/cardiovascular clinical phenotypes, following the annotation guidelines previously developed. A third physician adjudicated discordant annotations. Savana's performance was automatically calculated using as validation resource the gold standard created by the experts. Results We found good levels of performance achieved by Savana in the identification of mentions to atherosclerotic/cardiovascular clinical phenotypes, yielding an overall P, R, and F-score of 0.97, 0.92, and 0.94, respectively. We also found that going through all the EHRs and identifying the mentions to atherosclerotic/cardiovascular clinical phenotypes, the expert spent ∼ 60h while Savana ∼ 36 min. Conclusion(s) Innovative techniques to identify atherosclerotic/cardiovascular clinical phenotypes could be used to support real world data research and clinical practice. Overall Savana showed a high performance, comparable with those obtained by an expert physician annotator doing the same task. Additionally, a significant reduction of time in using automatic information extraction system was achieved.


2019 ◽  
Vol 26 (4) ◽  
pp. 364-379 ◽  
Author(s):  
Theresa A Koleck ◽  
Caitlin Dreisbach ◽  
Philip E Bourne ◽  
Suzanne Bakken

Abstract Objective Natural language processing (NLP) of symptoms from electronic health records (EHRs) could contribute to the advancement of symptom science. We aim to synthesize the literature on the use of NLP to process or analyze symptom information documented in EHR free-text narratives. Materials and Methods Our search of 1964 records from PubMed and EMBASE was narrowed to 27 eligible articles. Data related to the purpose, free-text corpus, patients, symptoms, NLP methodology, evaluation metrics, and quality indicators were extracted for each study. Results Symptom-related information was presented as a primary outcome in 14 studies. EHR narratives represented various inpatient and outpatient clinical specialties, with general, cardiology, and mental health occurring most frequently. Studies encompassed a wide variety of symptoms, including shortness of breath, pain, nausea, dizziness, disturbed sleep, constipation, and depressed mood. NLP approaches included previously developed NLP tools, classification methods, and manually curated rule-based processing. Only one-third (n = 9) of studies reported patient demographic characteristics. Discussion NLP is used to extract information from EHR free-text narratives written by a variety of healthcare providers on an expansive range of symptoms across diverse clinical specialties. The current focus of this field is on the development of methods to extract symptom information and the use of symptom information for disease classification tasks rather than the examination of symptoms themselves. Conclusion Future NLP studies should concentrate on the investigation of symptoms and symptom documentation in EHR free-text narratives. Efforts should be undertaken to examine patient characteristics and make symptom-related NLP algorithms or pipelines and vocabularies openly available.


2021 ◽  
Vol 12 (04) ◽  
pp. 816-825
Author(s):  
Yingcheng Sun ◽  
Alex Butler ◽  
Ibrahim Diallo ◽  
Jae Hyun Kim ◽  
Casey Ta ◽  
...  

Abstract Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population. Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage. Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial. Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness. Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.


2021 ◽  
Author(s):  
Ye Seul Bae ◽  
Kyung Hwan Kim ◽  
Han Kyul Kim ◽  
Sae Won Choi ◽  
Taehoon Ko ◽  
...  

BACKGROUND Smoking is a major risk factor and important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). OBJECTIVE We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). METHODS With acronym replacement and Python package Soynlp, we normalize 4,711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. RESULTS Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual clinical notes. Given an identical SVM classifier, the extracted keywords improve the F1 score by as much as 1.8% compared to those of the unigram and bigram Bag of Words. CONCLUSIONS Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired and used for clinical practice and research.


2015 ◽  
Vol 22 (6) ◽  
pp. 1220-1230 ◽  
Author(s):  
Huan Mo ◽  
William K Thompson ◽  
Luke V Rasmussen ◽  
Jennifer A Pacheco ◽  
Guoqian Jiang ◽  
...  

Abstract Background Electronic health records (EHRs) are increasingly used for clinical and translational research through the creation of phenotype algorithms. Currently, phenotype algorithms are most commonly represented as noncomputable descriptive documents and knowledge artifacts that detail the protocols for querying diagnoses, symptoms, procedures, medications, and/or text-driven medical concepts, and are primarily meant for human comprehension. We present desiderata for developing a computable phenotype representation model (PheRM). Methods A team of clinicians and informaticians reviewed common features for multisite phenotype algorithms published in PheKB.org and existing phenotype representation platforms. We also evaluated well-known diagnostic criteria and clinical decision-making guidelines to encompass a broader category of algorithms. Results We propose 10 desired characteristics for a flexible, computable PheRM: (1) structure clinical data into queryable forms; (2) recommend use of a common data model, but also support customization for the variability and availability of EHR data among sites; (3) support both human-readable and computable representations of phenotype algorithms; (4) implement set operations and relational algebra for modeling phenotype algorithms; (5) represent phenotype criteria with structured rules; (6) support defining temporal relations between events; (7) use standardized terminologies and ontologies, and facilitate reuse of value sets; (8) define representations for text searching and natural language processing; (9) provide interfaces for external software algorithms; and (10) maintain backward compatibility. Conclusion A computable PheRM is needed for true phenotype portability and reliability across different EHR products and healthcare systems. These desiderata are a guide to inform the establishment and evolution of EHR phenotype algorithm authoring platforms and languages.


BMJ Open ◽  
2019 ◽  
Vol 9 (10) ◽  
pp. e031373 ◽  
Author(s):  
Jennifer Anne Davidson ◽  
Amitava Banerjee ◽  
Rutendo Muzambi ◽  
Liam Smeeth ◽  
Charlotte Warren-Gash

IntroductionCardiovascular diseases (CVDs) are among the leading causes of death globally. Electronic health records (EHRs) provide a rich data source for research on CVD risk factors, treatments and outcomes. Researchers must be confident in the validity of diagnoses in EHRs, particularly when diagnosis definitions and use of EHRs change over time. Our systematic review provides an up-to-date appraisal of the validity of stroke, acute coronary syndrome (ACS) and heart failure (HF) diagnoses in European primary and secondary care EHRs.Methods and analysisWe will systematically review the published and grey literature to identify studies validating diagnoses of stroke, ACS and HF in European EHRs. MEDLINE, EMBASE, SCOPUS, Web of Science, Cochrane Library, OpenGrey and EThOS will be searched from the dates of inception to April 2019. A prespecified search strategy of subject headings and free-text terms in the title and abstract will be used. Two reviewers will independently screen titles and abstracts to identify eligible studies, followed by full-text review. We require studies to compare clinical codes with a suitable reference standard. Additionally, at least one validation measure (sensitivity, specificity, positive predictive value or negative predictive value) or raw data, for the calculation of a validation measure, is necessary. We will then extract data from the eligible studies using standardised tables and assess risk of bias in individual studies using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. Data will be synthesised into a narrative format and heterogeneity assessed. Meta-analysis will be considered when a sufficient number of homogeneous studies are available. The overall quality of evidence will be assessed using the Grading of Recommendations, Assessment, Development and Evaluation tool.Ethics and disseminationThis is a systematic review, so it does not require ethical approval. Our results will be submitted for peer-review publication.PROSPERO registration numberCRD42019123898


2018 ◽  
Author(s):  
Kohei Kajiyama ◽  
Hiromasa Horiguchi ◽  
Takashi Okumura ◽  
Mizuki Morita ◽  
Yoshinobu Kano

Sign in / Sign up

Export Citation Format

Share Document