Using topic modelling for unsupervised annotation of electronic health records to identify an outbreak of disease in UK dogs

A key goal of disease surveillance is to identify outbreaks of known or novel diseases in a timely manner. Such an outbreak occurred in the UK associated with acute vomiting in dogs between December 2019 and March 2020. We tracked this outbreak using the clinical free text component of anonymised electronic health records (EHRs) collected from a sentinel network of participating veterinary practices. We sourced the free text (narrative) component of each EHR supplemented with one of 10 practitioner-derived main presenting complaints (MPCs), with the ‘gastroenteric’ MPC identifying cases involved in the disease outbreak. Such clinician-derived annotation systems can suffer from poor compliance requiring retrospective, often manual, coding, thereby limiting real-time usability, especially where an outbreak of a novel disease might not present clinically as a currently recognised syndrome or MPC. Here, we investigate the use of an unsupervised method of EHR annotation using latent Dirichlet allocation topic-modelling to identify topics inherent within the clinical narrative component of EHRs. The model comprised 30 topics which were used to annotate EHRs spanning the natural disease outbreak and investigate whether any given topic might mirror the outbreak time-course. Narratives were annotated using the Gensim Library LdaModel module for the topic best representing the text within them. Counts for narratives labelled with one of the topics significantly matched the disease outbreak based on the practitioner-derived ‘gastroenteric’ MPC (Spearman correlation 0.978); no other topics showed a similar time course. Using artificially injected outbreaks, it was possible to see other topics that would match other MPCs including respiratory disease. The underlying topics were readily evaluated using simple word-cloud representations and using a freely available package (LDAVis) providing rapid insight into the clinical basis of each topic. This work clearly shows that unsupervised record annotation using topic modelling linked to simple text visualisations can provide an easily interrogable method to identify and characterise outbreaks and other anomalies of known and previously un-characterised diseases based on changes in clinical narratives.

Download Full-text

A Framework for Systematic Assessment of Clinical Trial Population Representativeness Using Electronic Health Records Data

Applied Clinical Informatics ◽

10.1055/s-0041-1733846 ◽

2021 ◽

Vol 12 (04) ◽

pp. 816-825

Author(s):

Yingcheng Sun ◽

Alex Butler ◽

Ibrahim Diallo ◽

Jae Hyun Kim ◽

Casey Ta ◽

...

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Electronic Health Records ◽

The United States ◽

Design Stage ◽

Common Data Model ◽

Free Text ◽

Eligibility Criteria ◽

Health Records ◽

Electronic Health

Abstract Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population. Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage. Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial. Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness. Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.

Download Full-text

Validity of acute cardiovascular outcome diagnoses in European electronic health records: a systematic review protocol

BMJ Open ◽

10.1136/bmjopen-2019-031373 ◽

2019 ◽

Vol 9 (10) ◽

pp. e031373 ◽

Cited By ~ 1

Author(s):

Jennifer Anne Davidson ◽

Amitava Banerjee ◽

Rutendo Muzambi ◽

Liam Smeeth ◽

Charlotte Warren-Gash

Keyword(s):

Systematic Review ◽

Electronic Health Records ◽

Predictive Value ◽

Grey Literature ◽

Cochrane Library ◽

Free Text ◽

Health Records ◽

Coronary Syndrome ◽

Validation Measure ◽

Electronic Health

IntroductionCardiovascular diseases (CVDs) are among the leading causes of death globally. Electronic health records (EHRs) provide a rich data source for research on CVD risk factors, treatments and outcomes. Researchers must be confident in the validity of diagnoses in EHRs, particularly when diagnosis definitions and use of EHRs change over time. Our systematic review provides an up-to-date appraisal of the validity of stroke, acute coronary syndrome (ACS) and heart failure (HF) diagnoses in European primary and secondary care EHRs.Methods and analysisWe will systematically review the published and grey literature to identify studies validating diagnoses of stroke, ACS and HF in European EHRs. MEDLINE, EMBASE, SCOPUS, Web of Science, Cochrane Library, OpenGrey and EThOS will be searched from the dates of inception to April 2019. A prespecified search strategy of subject headings and free-text terms in the title and abstract will be used. Two reviewers will independently screen titles and abstracts to identify eligible studies, followed by full-text review. We require studies to compare clinical codes with a suitable reference standard. Additionally, at least one validation measure (sensitivity, specificity, positive predictive value or negative predictive value) or raw data, for the calculation of a validation measure, is necessary. We will then extract data from the eligible studies using standardised tables and assess risk of bias in individual studies using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. Data will be synthesised into a narrative format and heterogeneity assessed. Meta-analysis will be considered when a sufficient number of homogeneous studies are available. The overall quality of evidence will be assessed using the Grading of Recommendations, Assessment, Development and Evaluation tool.Ethics and disseminationThis is a systematic review, so it does not require ethical approval. Our results will be submitted for peer-review publication.PROSPERO registration numberCRD42019123898

Download Full-text

De-identifying Free Text of Japanese Dummy Electronic Health Records

10.18653/v1/w18-5608 ◽

2018 ◽

Author(s):

Kohei Kajiyama ◽

Hiromasa Horiguchi ◽

Takashi Okumura ◽

Mizuki Morita ◽

Yoshinobu Kano

Keyword(s):

Electronic Health Records ◽

Free Text ◽

Health Records ◽

Electronic Health

Download Full-text

Abstract MP21: Feasibility of Electronic Health Records-based community surveillance of cardiovascular disease: Findings from the Atherosclerosis Risk in Communities Study.

Circulation ◽

10.1161/circ.137.suppl_1.mp21 ◽

2018 ◽

Vol 137 (suppl_1) ◽

Author(s):

Brittany M Bogle ◽

Wayne D Rosamond ◽

Aaron R Folsom ◽

Paul Sorlie ◽

Elsayed Z Soliman ◽

...

Keyword(s):

Cardiovascular Disease ◽

Electronic Health Records ◽

Cardiac Biomarkers ◽

Free Text ◽

Health Records ◽

Efficient System ◽

Atherosclerosis Risk In Communities ◽

Atherosclerosis Risk ◽

Electronic Health ◽

Aric Study

Background: Accurate community surveillance of cardiovascular disease requires hospital record abstraction, which is typically a manual process. The costly and time-intensive nature of manual abstraction precludes its use on a regional or national scale in the US. Whether an efficient system can accurately reproduce traditional community surveillance methods by processing electronic health records (EHRs) has not been established. Objective: We sought to develop and test an EHR-based system to reproduce abstraction and classification procedures for acute myocardial infarction (MI) as defined by the Atherosclerosis Risk in Communities (ARIC) Study. Methods: Records from hospitalizations in 2014 within ARIC community surveillance areas were sampled using a broad set of ICD discharge codes likely to harbor MI. These records were manually abstracted by ARIC study personnel and used to classify MI according to ARIC protocols. We requested EHRs in a unified data structure for the same hospitalizations at 6 hospitals and built programs to convert free text and structured data into the ARIC criteria elements necessary for MI classification. Per ARIC protocol, MI was classified based on cardiac biomarkers, cardiac pain, and Minnesota-coded electrocardiogram abnormalities. We compared MI classified from manually abstracted data to (1) EHR-based classification and (2) final ICD-9 coded discharge diagnoses (410-414). Results: These preliminary results are based on hospitalizations from 1 hospital. Of 684 hospitalizations, 355 qualified for full manual abstraction; 83 (23%) of these were classified as definite MI and 78 (22%) as probable MI. Our EHR-based abstraction is sensitive (>75%) and highly specific (>83%) in classifying ARIC-defined definite MI and definite or probable MI (Table). Conclusions: Our results support the potential of a process to extract comprehensive sets of data elements from EHR from different hospitals, with completeness and accuracy sufficient for a standardized definition of hospitalized MI.

Download Full-text

Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records

Rheumatology ◽

10.1093/rheumatology/kez375 ◽

2019 ◽

Vol 59 (5) ◽

pp. 1059-1065 ◽

Cited By ~ 1

Author(s):

Sizheng Steven Zhao ◽

Chuan Hong ◽

Tianrun Cai ◽

Chang Xu ◽

Jie Huang ◽

...

Keyword(s):

Electronic Health Records ◽

Predictive Value ◽

Area Under The Curve ◽

Free Text ◽

Text Data ◽

Health Records ◽

Disease Concepts ◽

Icd Codes ◽

Electronic Health

Abstract Objectives To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes. Methods An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms—on a training set of 127 axSpA cases and 423 non-cases—and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only. Results NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80–0.87). Conclusion Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.

Download Full-text

Documentation of social determinants in electronic health records with and without standardized terminologies: A comparative study

Proceedings of Singapore Healthcare ◽

10.1177/2010105818785641 ◽

2018 ◽

Vol 28 (1) ◽

pp. 39-47 ◽

Cited By ~ 1

Author(s):

Karen A Monsen ◽

Joyce M Rudenick ◽

Nicole Kapinos ◽

Kathryn Warmbold ◽

Siobhan K McMahon ◽

...

Keyword(s):

Electronic Health Records ◽

Free Text ◽

Snomed Ct ◽

Health Records ◽

Behavioral Determinants ◽

Omaha System ◽

Standardized Terminology ◽

Electronic Health ◽

Data Elements ◽

Improve Health

Background: Electronic health records (EHRs) are a promising new source of population health data that may improve health outcomes. However, little is known about the extent to which social and behavioral determinants of health (SBDH) are currently documented in EHRs, including how SBDH are documented, and by whom. Standardized nursing terminologies have been developed to assess and document SBDH. Objective: We examined the documentation of SBDH in EHRs with and without standardized nursing terminologies. Methods: We carried out a review of the literature for SBDH phrases organized by topic, which were used for analyses. Key informant interviews were conducted regarding SBDH phrases. Results: In nine EHRs (six acute care, three community care) 107 SBDH phrases were documented using free text, structured text, and standardized terminologies in diverse screens and by multiple clinicians, admitting personnel, and other staff. SBDH phrases were documented using one of three standardized terminologies ( N = average number of phrases per terminology per EHR): ICD-9/10 ( N = 1); SNOMED CT ( N = 1); Omaha System ( N = 79). Most often, standardized terminology data were documented by nurses or other clinical staff versus receptionists or other non-clinical personnel. Documentation ‘unknown’ differed significantly between EHRs with and without the Omaha System (mean = 26.0 (standard deviation (SD) = 8.7) versus mean = 74.5 (SD = 16.5)) ( p = .005). SBDH documentation in EHRs differed based on the presence of a nursing terminology. Conclusions: The Omaha System enabled a more comprehensive, holistic assessment and documentation of interoperable SBDH data. Further research is needed to determine SBDH data elements that are needed across settings, the uses of SBDH data in practice, and to examine patient perspectives related to SBDH assessments.

Download Full-text

Unlocking the Potential of Electronic Health Records for Health Research

International Journal for Population Data Science ◽

10.23889/ijpds.v5i1.1123 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 1

Author(s):

Seungwon Lee ◽

Yuan Xu ◽

Adam G D'Souza ◽

Elliot A Martin ◽

Chelsea Doktorchik ◽

...

Keyword(s):

Electronic Health Records ◽

Health Research ◽

Care Delivery ◽

Free Text ◽

Imaging Data ◽

Health Records ◽

Data Source ◽

Electronic Health ◽

Data Elements ◽

The City

Electronic health records (EHRs), originally designed to facilitate health care delivery, are becoming a valuable data source for health research. EHR systems have two components: the front end, where the data is entered by healthcare workers including physicians and nurses, and the back-end electronic data warehouse where the data is stored in a relational database. EHR data elements can be of many types, which can be categorized as structured, unstructured free-text, and imaging data. The Sunrise Clinical Manager (SCM) EHR is one example of an inpatient EHR system, which covers the city of Calgary (Alberta, Canada). This system, under the management of Alberta Health Services, is now being explored for research use. The purpose of the present paper is to describe the SCM EHR for research purposes, showing how this generalizes to EHRs in general. We further discuss advantages, challenges (e.g. potential bias and data quality issues), and analytical capacities and requirements associated with using EHRs.

Download Full-text

Real-time clinician text feeds from electronic health records

10.1101/2020.10.02.20205617 ◽

2020 ◽

Author(s):

James Teo ◽

Vlad Dinu ◽

William Bernal ◽

Phil Davidson ◽

Vitaliy Oliynyk ◽

...

Keyword(s):

Social Media ◽

Electronic Health Records ◽

Real Time ◽

Capacity Planning ◽

Low Cost ◽

Free Text ◽

Record System ◽

Health Records ◽

Keywords And Phrases ◽

Electronic Health

AbstractAnalyses of search engine and social media feeds have been attempted for infectious disease outbreaks1, but have been found to be susceptible to artefactual distortions from health scares or keyword spamming in social media or the public internet 2–4. We describe an approach using real-time aggregation of keywords and phrases of free text from real-time clinician-generated documentation in electronic health records to produce a customisable real-time viral pneumonia signal providing up to 2 days warning for secondary care capacity planning. This low-cost approach is open-source, is locally customisable, is not dependent on any specific electronic health record system and can be deployed at multiple organisational scales.

Download Full-text

Natural language processing for disease phenotyping in UK primary care records for research: a pilot study in myocardial infarction and death

Journal of Biomedical Semantics ◽

10.1186/s13326-019-0214-4 ◽

2019 ◽

Vol 10 (S1) ◽

Cited By ~ 1

Author(s):

Anoop D. Shah ◽

Emily Bailey ◽

Tim Williams ◽

Spiros Denaxas ◽

Richard Dobson ◽

...

Keyword(s):

Myocardial Infarction ◽

Primary Care ◽

Electronic Health Records ◽

Natural Language ◽

Cause Of Death ◽

Free Text ◽

Health Records ◽

Death Registry ◽

Primary Care Record ◽

Electronic Health

Abstract Background Free text in electronic health records (EHR) may contain additional phenotypic information beyond structured (coded) information. For major health events – heart attack and death – there is a lack of studies evaluating the extent to which free text in the primary care record might add information. Our objectives were to describe the contribution of free text in primary care to the recording of information about myocardial infarction (MI), including subtype, left ventricular function, laboratory results and symptoms; and recording of cause of death. We used the CALIBER EHR research platform which contains primary care data from the Clinical Practice Research Datalink (CPRD) linked to hospital admission data, the MINAP registry of acute coronary syndromes and the death registry. In CALIBER we randomly selected 2000 patients with MI and 1800 deaths. We implemented a rule-based natural language engine, the Freetext Matching Algorithm, on site at CPRD to analyse free text in the primary care record without raw data being released to researchers. We analysed text recorded within 90 days before or 90 days after the MI, and on or after the date of death. Results We extracted 10,927 diagnoses, 3658 test results, 3313 statements of negation, and 850 suspected diagnoses from the myocardial infarction patients. Inclusion of free text increased the recorded proportion of patients with chest pain in the week prior to MI from 19 to 27%, and differentiated between MI subtypes in a quarter more patients than structured data alone. Cause of death was incompletely recorded in primary care; in 36% the cause was in coded data and in 21% it was in free text. Only 47% of patients had exactly the same cause of death in primary care and the death registry, but this did not differ between coded and free text causes of death. Conclusions Among patients who suffer MI or die, unstructured free text in primary care records contains much information that is potentially useful for research such as symptoms, investigation results and specific diagnoses. Access to large scale unstructured data in electronic health records (millions of patients) might yield important insights.

Download Full-text