Cohort profile: St. Michael's Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael's Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael's Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N=3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4-21.2%) were diagnosed with active TB and 46% (95% CI: 43.8-47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8-19.7%) and 40% (95% CI: 37.8-41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modelling studies.

Download Full-text

Cohort profile: St. Michael’s Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

PLoS ONE ◽

10.1371/journal.pone.0247872 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0247872

Author(s):

David Landsman ◽

Ahmed Abdelbasit ◽

Christine Wang ◽

Michael Guerzhoy ◽

Ujash Joshi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Epidemiology ◽

Relevant Information ◽

Structured Data ◽

Free Text ◽

Electronic Health Record Data ◽

Latent Tb ◽

Electronic Health

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael’s Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael’s Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N = 3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4–21.2%) were diagnosed with active TB and 46% (95% CI: 43.8–47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8–19.7%) and 40% (95% CI: 37.8–41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data by using NLP rulesets. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modeling studies.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text

Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data

10.1101/19011643 ◽

2019 ◽

Author(s):

Daniel M. Bean ◽

James Teo ◽

Honghan Wu ◽

Ricardo Oliveira ◽

Raj Patel ◽

...

Keyword(s):

Atrial Fibrillation ◽

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Open Source ◽

Language Processing ◽

Risk Scores ◽

Free Text ◽

Health Record ◽

Electronic Health

AbstractAtrial fibrillation (AF) is the most common arrhythmia and significantly increases stroke risk. This risk is effectively managed by oral anticoagulation. Recent studies using national registry data indicate increased use of anticoagulation resulting from changes in guidelines and the availability of newer drugs.The aim of this study is to develop and validate an open source risk scoring pipeline for free-text electronic health record data using natural language processing.AF patients discharged from 1st January 2011 to 1st October 2017 were identified from discharge summaries (N=10,030, 64.6% male, average age 75.3 ± 12.3 years). A natural language processing pipeline was developed to identify risk factors in clinical text and calculate risk for ischaemic stroke (CHA2DS2-VASc) and bleeding (HAS-BLED). Scores were validated vs two independent experts for 40 patients.Automatic risk scores were in strong agreement with the two independent experts for CHA2DS2-VASc (average kappa 0.78 vs experts, compared to 0.85 between experts). Agreement was lower for HAS-BLED (average kappa 0.54 vs experts, compared to 0.74 between experts).In high-risk patients (CHA2DS2-VASc ≥2) OAC use has increased significantly over the last 7 years, driven by the availability of DOACs and the transitioning of patients from AP medication alone to OAC. Factors independently associated with OAC use included components of the CHA2DS2-VASc and HAS-BLED scores as well as discharging specialty and frailty. OAC use was highest in patients discharged under cardiology (69%).Electronic health record text can be used for automatic calculation of clinical risk scores at scale. Open source tools are available today for this task but require further validation. Analysis of routinely-collected EHR data can replicate findings from large-scale curated registries.

Download Full-text

Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation (Preprint)

10.2196/preprints.17376 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sijia Liu ◽

Yanshan Wang ◽

Andrew Wen ◽

Liwei Wang ◽

Na Hong ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Model ◽

Structured Data ◽

Common Data Model ◽

Concept System ◽

Unstructured Text ◽

Electronic Health

BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text—Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.

Download Full-text

A comparison of structured data query methods versus natural language processing to identify metastatic melanoma cases from electronic health records

International Journal of Computational Medicine and Healthcare ◽

10.1504/ijcmh.2019.104364 ◽

2019 ◽

Vol 1 (1) ◽

pp. 101 ◽

Cited By ~ 1

Author(s):

Jinghua He ◽

Lawrence Mark ◽

Charity Hilton ◽

Joel Martin ◽

Jarod Baker ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Metastatic Melanoma ◽

Language Processing ◽

Structured Data ◽

Health Records ◽

Data Query ◽

Electronic Health

Download Full-text

Natural language processing tool for automatic diseases and drugs recognition from electronic health records in polish- pilot study

European Heart Journal ◽

10.1093/eurheartj/ehab724.3169 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

C M Maciejewski ◽

M K Krajsman ◽

K O Ozieranski ◽

M B Basza ◽

M G Gawalko ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Structured Data ◽

Health Records ◽

Funding Sources ◽

Natural Language Processing Tool ◽

Electronic Health ◽

Automatic Tool

Abstract Background An estimate of 80% of data gathered in electronic health records is unstructured, textual information that cannot be utilized for research purposes until it is manually coded into a database. Manual coding is a both cost and time- consuming process. Natural language processing (NLP) techniques may be utilized for extraction of structured data from text. However, little is known about the accuracy of data obtained through these methods. Purpose To evaluate the possibility of employing NLP techniques in order to obtain data regarding risk factors needed for CHA2DS2VASc scale calculation and detection of antithrombotic medication prescribed in the population of atrial fibrillation (AF) patients of a cardiology ward. Methods An automatic tool for diseases and drugs recognition based on regular expressions rules was designed through cooperation of physicians and IT specialists. Records of 194 AF patients discharged from a cardiology ward were manually reviewed by a physician- annotator as a comparator for the automatic approach. Results Median CHA2DS2VASc score calculated by the automatic was 3 (IQR 2–4) versus 3 points (IQR 2–4) for the manual method (p=0.66). High agreement between CHA2DS2VASc scores calculated by both methods was present (Kendall's W=0.979; p<0.001). In terms of anticoagulant recognition, the automatic tool misqualified the drug prescribed in 4 cases. Conclusion NLP-based techniques are a promising tools for obtaining structured data for research purposes from electronic health records in polish. Tight cooperation of physicians and IT specialists is crucial for establishing accurate recognition patterns. Funding Acknowledgement Type of funding sources: None.

Download Full-text

Natural Language Processing to Identify Cancer Treatments With Electronic Medical Records

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00173 ◽

2021 ◽

pp. 379-393

Author(s):

Jiaming Zeng ◽

Imon Banerjee ◽

A. Solomon Henry ◽

Douglas J. Wood ◽

Ross D. Shachter ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Electronic Medical Records ◽

Language Processing ◽

Medical Records ◽

Structured Data ◽

Free Text ◽

Esophagus Cancer ◽

Clinical Notes ◽

Patients With Cancer

PURPOSE Knowing the treatments administered to patients with cancer is important for treatment planning and correlating treatment patterns with outcomes for personalized medicine study. However, existing methods to identify treatments are often lacking. We develop a natural language processing approach with structured electronic medical records and unstructured clinical notes to identify the initial treatment administered to patients with cancer. METHODS We used a total number of 4,412 patients with 483,782 clinical notes from the Stanford Cancer Institute Research Database containing patients with nonmetastatic prostate, oropharynx, and esophagus cancer. We trained treatment identification models for each cancer type separately and compared performance of using only structured, only unstructured ( bag-of-words, doc2vec, fasttext), and combinations of both ( structured + bow, structured + doc2vec, structured + fasttext). We optimized the identification model among five machine learning methods (logistic regression, multilayer perceptrons, random forest, support vector machines, and stochastic gradient boosting). The treatment information recorded in the cancer registry is the gold standard and compares our methods to an identification baseline with billing codes. RESULTS For prostate cancer, we achieved an f1-score of 0.99 (95% CI, 0.97 to 1.00) for radiation and 1.00 (95% CI, 0.99 to 1.00) for surgery using structured + doc2vec. For oropharynx cancer, we achieved an f1-score of 0.78 (95% CI, 0.58 to 0.93) for chemoradiation and 0.83 (95% CI, 0.69 to 0.95) for surgery using doc2vec. For esophagus cancer, we achieved an f1-score of 1.0 (95% CI, 1.0 to 1.0) for both chemoradiation and surgery using all combinations of structured and unstructured data. We found that employing the free-text clinical notes outperforms using the billing codes or only structured data for all three cancer types. CONCLUSION Our results show that treatment identification using free-text clinical notes greatly improves upon the performance using billing codes and simple structured data. The approach can be used for treatment cohort identification and adapted for longitudinal cancer treatment identification.

Download Full-text

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy173 ◽

2019 ◽

Vol 26 (4) ◽

pp. 364-379 ◽

Cited By ~ 34

Author(s):

Theresa A Koleck ◽

Caitlin Dreisbach ◽

Philip E Bourne ◽

Suzanne Bakken

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Healthcare Providers ◽

Patient Characteristics ◽

Disease Classification ◽

Free Text ◽

Health Records ◽

Electronic Health

Abstract Objective Natural language processing (NLP) of symptoms from electronic health records (EHRs) could contribute to the advancement of symptom science. We aim to synthesize the literature on the use of NLP to process or analyze symptom information documented in EHR free-text narratives. Materials and Methods Our search of 1964 records from PubMed and EMBASE was narrowed to 27 eligible articles. Data related to the purpose, free-text corpus, patients, symptoms, NLP methodology, evaluation metrics, and quality indicators were extracted for each study. Results Symptom-related information was presented as a primary outcome in 14 studies. EHR narratives represented various inpatient and outpatient clinical specialties, with general, cardiology, and mental health occurring most frequently. Studies encompassed a wide variety of symptoms, including shortness of breath, pain, nausea, dizziness, disturbed sleep, constipation, and depressed mood. NLP approaches included previously developed NLP tools, classification methods, and manually curated rule-based processing. Only one-third (n = 9) of studies reported patient demographic characteristics. Discussion NLP is used to extract information from EHR free-text narratives written by a variety of healthcare providers on an expansive range of symptoms across diverse clinical specialties. The current focus of this field is on the development of methods to extract symptom information and the use of symptom information for disease classification tasks rather than the examination of symptoms themselves. Conclusion Future NLP studies should concentrate on the investigation of symptoms and symptom documentation in EHR free-text narratives. Efforts should be undertaken to examine patient characteristics and make symptom-related NLP algorithms or pipelines and vocabularies openly available.

Download Full-text

Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods (Preprint)

10.2196/preprints.13039 ◽

2018 ◽

Author(s):

Tao Chen ◽

Mark Dredze ◽

Jonathan P Weiner ◽

Leilani Hernandez ◽

Joe Kimura ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Geriatric Syndrome ◽

Free Text ◽

Geriatric Syndromes ◽

Health Maintenance ◽

Clinical Notes ◽

Statistical Natural Language Processing ◽

Electronic Health

BACKGROUND Geriatric syndromes in older adults are associated with adverse outcomes. However, despite being reported in clinical notes, these syndromes are often poorly captured by diagnostic codes in the structured fields of electronic health records (EHRs) or administrative records. OBJECTIVE We aim to automatically determine if a patient has any geriatric syndromes by mining the free text of associated EHR clinical notes. We assessed which statistical natural language processing (NLP) techniques are most effective. METHODS We applied conditional random fields (CRFs), a widely used machine learning algorithm, to identify each of 10 geriatric syndrome constructs in a clinical note. We assessed three sets of features and attributes for CRF operations: a base set, enhanced token, and contextual features. We trained the CRF on 3901 manually annotated notes from 85 patients, tuned the CRF on a validation set of 50 patients, and evaluated it on 50 held-out test patients. These notes were from a group of US Medicare patients over 65 years of age enrolled in a Medicare Advantage Health Maintenance Organization and cared for by a large group practice in Massachusetts. RESULTS A final feature set was formed through comprehensive feature ablation experiments. The final CRF model performed well at patient-level determination (macroaverage F1=0.834, microaverage F1=0.851); however, performance varied by construct. For example, at phrase-partial evaluation, the CRF model worked well on constructs such as absence of fecal control (F1=0.857) and vision impairment (F1=0.798) but poorly on malnutrition (F1=0.155), weight loss (F1=0.394), and severe urinary control issues (F1=0.532). Errors were primarily due to previously unobserved words (ie, out-of-vocabulary) and a lack of context. CONCLUSIONS This study shows that statistical NLP can be used to identify geriatric syndromes from EHR-extracted clinical notes. This creates new opportunities to identify patients with geriatric syndromes and study their health outcomes.

Download Full-text

Automatic identification of risk factors for SARS-CoV-2 positivity and severe clinical outcomes of COVID-19 using Data Mining and Natural Language Processing

10.1101/2021.03.25.21254314 ◽

2021 ◽

Author(s):

Verena Schoening ◽

Evangelia Liakoni ◽

Juergen Drewe ◽

Felix Hammann

Keyword(s):

Risk Factors ◽

Data Mining ◽

Natural Language Processing ◽

Natural Language ◽

Real Time ◽

Clinical Outcomes ◽

Language Processing ◽

Structured Data ◽

Diabetic Patients ◽

Free Text

Objectives: Several risk factors have been identified for severe clinical outcomes of COVID-19 caused by SARS-CoV-2. Some can be found in structured data of patients' Electronic Health Records. Others are included as unstructured free-text, and thus cannot be easily detected automatically. We propose an automated real-time detection of risk factors using a combination of data mining and Natural Language Processing (NLP). Material and methods: Patients were categorized as negative or positive for SARS-CoV-2, and according to disease severity (severe or non-severe COVID-19). Comorbidities were identified in the unstructured free-text using NLP. Further risk factors were taken from the structured data. Results: 6250 patients were analysed (5664 negative and 586 positive; 461 non-severe and 125 severe). Using NLP, comorbidities, i.e. cardiovascular and pulmonary conditions, diabetes, dementia and cancer, were automatically detected (error rate ≤2%). Old age, male sex, higher BMI, arterial hypertension, chronic heart failure, coronary heart disease, COPD, diabetes, insulin only treatment of diabetic patients, reduced kidney and liver function were risk factors for severe COVID-19. Interestingly, the proportion of diabetic patients using metformin but not insulin was significantly higher in the non-severe COVID-19 cohort (p<0.05). Discussion and conclusion: Our findings were in line with previously reported risk factors for severe COVID-19. NLP in combination with other data mining approaches appears to be a suitable tool for the automated real-time detection of risk factors, which can be a time saving support for risk assessment and triage, especially in patients with long medical histories and multiple comorbidities.

Download Full-text