Measuring Adoption of Patient Priorities-Aligned Care Using Natural Language Processing of Electronic Health Records (Preprint)

BACKGROUND Patient Priorities Care (PPC) is a model of care that aligns health care recommendations with priorities of older adults with multiple chronic conditions. Following identification of patient priorities, this information is documented in the patient’s electronic health record (EHR). OBJECTIVE Our goal is to develop and validate a natural language processing (NLP) model that reliably documents when clinicians identify patient priorities (i.e., values, outcome goals and care preferences) within the EHR as a measure of PPC adoption. METHODS Design: Retrospective analysis of unstructured EHR free-text notes using an NLP model. Setting: National Veteran Health Administration (VHA) EHR. Participants (including the sample size): 778 patient notes of 658 patients from encounters with 144 social workers in the primary care setting Measurements: Each patient’s free-text clinical note was reviewed by two independent reviewers for the presence of PPC language such as priorities, values, and goals. We developed a NLP model that utilized statistical machine learning approaches. The performance of the NLP model in training and validation with 10-fold cross-validation is reported via accuracy, recall, and precision in comparison to the chart review. RESULTS Results: Out of 778 notes, 589 (76%) were identified as containing PPC language (Kappa = 0.82, p-value < 0.001). The NLP model in the training stage had an accuracy of 0.98 (0.98, 0.99), a recall of 0.98 (0.98, 0.99), and precision of 0.98 (0.97, 1.00). The NLP model in the validation stage has an accuracy of 0.92 (0.90, 0.94), a recall of 0.84 (0.79, 0.89), and precision of 0.84 (0.77, 0.91). In contrast, an approach using simple search terms for PPC only had a precision of 0.757. CONCLUSIONS Discussion and Implications: An automated NLP model can reliably measure with high precision, recall, and accuracy when clinicians document patient priorities as a key step in the adoption of PPC. CLINICALTRIAL

Download Full-text

SAT-LB111 Improving Classification of Diabetes Etiology in Electronic Resources Using Phenotype Algorithms and Polygenic Risk Scores

Journal of the Endocrine Society ◽

10.1210/jendso/bvaa046.2239 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

Author(s):

Lina Sulieman ◽

Jing He ◽

Robert Carroll ◽

Lisa Bastarache ◽

Andrea Ramirez

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Risk Scores ◽

P Value ◽

Learning Approaches ◽

Data Types ◽

Electronic Health

Abstract Electronic Health Records (EHR) contain rich data to identify and study diabetes. Many phenotype algorithms have been developed to identify research subjects with type 2 diabetes (T2D), but very few accurately identify type 1 diabetes (T1D) cases or more rare forms of monogenic and atypical metabolic presentations. Polygenetic risk scores (PRS) quantify risk of a disease using common genomic variants well for both T1D and T2D. In this study, we apply validated phenotyping algorithms to EHRs linked to a genomic biobank to understand the independent contribution of PRS to classification of diabetes etiology and generate additional novel markers to distinguish subtypes of diabetes in EHR data. Using a de-identified mirror of medical center’s electronic health record, we applied published algorithms for T1D and T2D to identify cases, and used natural language processing and chart review strategies to identify cases of maturity onset diabetes of the young (MODY) and other more rare presentations. This novel approach included additional data types such as medication sequencing, ratio and temporality of insulin and non-insulin agents, clinical genetic testing, and ratios of diagnostic codes. Chart review was performed to validate etiology. To calculate PRS, we used genome wide genotyping from our BioBank, the de-identified biobank linking EHR to genomic data using coefficients of 65 published T1D SNPS and 76,996 T2D SNPS using PLINK in Caucasian subjects. In the dataset, we identified 82,238 cases of T2D but only 130 cases of T1D using the most cited published algorithms. Adding novel structured elements and natural language processing identified an additional 138 cases of T1D and distinguished 354 cases as MODY. Among over 90,000 subjects with genotyping data available, we included 72,624 Caucasian subjects since PRS coefficients were generated in Caucasian cohorts. Among those subjects, 248, 6,488, and 21 subjects were identified as T1D, T2D, and MODY subjects respectively in our final PRS cohort. The T1D PRS did significantly discriminate well between cases and controls (Mann-Whitney p-value is 3.4 e-17). The PRS for T2D did not significantly discriminate between cases and controls using published algorithms. The atypical case count was too low to calculate PRS discrimination. Calculation of the PRS score was limited by quality inclusion of variants available, and discrimination may improve in larger data sets. Additionally, blinded physician case review is ongoing to validate the novel classification scheme and provide a gold standard for machine learning approaches that can be applied in validation sets.

Download Full-text

Measuring Adoption of Patient Priorities-Aligned Care Using Natural Language Processing

Innovation in Aging ◽

10.1093/geroni/igaa057.592 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 183-183

Author(s):

Javad Razjouyan ◽

Jennifer Freytag ◽

Edward Odom ◽

Lilian Dindo ◽

Aanand Naik

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Group Analysis ◽

Intervention Group ◽

Multiple Chronic Conditions ◽

Free Text ◽

Term Care

Abstract Patient Priorities Care (PPC) is a model of care that aligns health care recommendations with priorities of older adults with multiple chronic conditions. Social workers (SW), after online training, document PPC in the patient’s electronic health record (EHR). Our goal is to identify free-text notes with PPC language using a natural language processing (NLP) model and to measure PPC adoption and effect on long term services and support (LTSS) use. Free-text notes from the EHR produced by trained SWs passed through a hybrid NLP model that utilized rule-based and statistical machine learning. NLP accuracy was validated against chart review. Patients who received PPC were propensity matched with patients not receiving PPC (control) on age, gender, BMI, Charlson comorbidity index, facility and SW. The change in LTSS utilization 6-month intervals were compared by groups with univariate analysis. Chart review indicated that 491 notes out of 689 had PPC language and the NLP model reached to precision of 0.85, a recall of 0.90, an F1 of 0.87, and an accuracy of 0.91. Within group analysis shows that intervention group used LTSS 1.8 times more in the 6 months after the encounter compared to 6 months prior. Between group analysis shows that intervention group has significant higher number of LTSS utilization (p=0.012). An automated NLP model can be used to reliably measure the adaptation of PPC by SW. PPC seems to encourage use of LTSS that may delay time to long term care placement.

Download Full-text

Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data

10.1101/19011643 ◽

2019 ◽

Author(s):

Daniel M. Bean ◽

James Teo ◽

Honghan Wu ◽

Ricardo Oliveira ◽

Raj Patel ◽

...

Keyword(s):

Atrial Fibrillation ◽

Natural Language Processing ◽

Natural Language ◽

Electronic Health Record ◽

Open Source ◽

Language Processing ◽

Risk Scores ◽

Free Text ◽

Health Record ◽

Electronic Health

AbstractAtrial fibrillation (AF) is the most common arrhythmia and significantly increases stroke risk. This risk is effectively managed by oral anticoagulation. Recent studies using national registry data indicate increased use of anticoagulation resulting from changes in guidelines and the availability of newer drugs.The aim of this study is to develop and validate an open source risk scoring pipeline for free-text electronic health record data using natural language processing.AF patients discharged from 1st January 2011 to 1st October 2017 were identified from discharge summaries (N=10,030, 64.6% male, average age 75.3 ± 12.3 years). A natural language processing pipeline was developed to identify risk factors in clinical text and calculate risk for ischaemic stroke (CHA2DS2-VASc) and bleeding (HAS-BLED). Scores were validated vs two independent experts for 40 patients.Automatic risk scores were in strong agreement with the two independent experts for CHA2DS2-VASc (average kappa 0.78 vs experts, compared to 0.85 between experts). Agreement was lower for HAS-BLED (average kappa 0.54 vs experts, compared to 0.74 between experts).In high-risk patients (CHA2DS2-VASc ≥2) OAC use has increased significantly over the last 7 years, driven by the availability of DOACs and the transitioning of patients from AP medication alone to OAC. Factors independently associated with OAC use included components of the CHA2DS2-VASc and HAS-BLED scores as well as discharging specialty and frailty. OAC use was highest in patients discharged under cardiology (69%).Electronic health record text can be used for automatic calculation of clinical risk scores at scale. Open source tools are available today for this task but require further validation. Analysis of routinely-collected EHR data can replicate findings from large-scale curated registries.

Download Full-text

Abstract 125: Natural Language Processing Identifies an Association Between Canadian Cardiovascular Society Angina Severity and Mortality Within the Department of Veterans Affairs

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.10.suppl_3.125 ◽

2017 ◽

Vol 10 (suppl_3) ◽

Author(s):

Mina Owlia ◽

John A Dodson ◽

Scott L DuVall ◽

Joanne LaFleur ◽

Olga Patterson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Stable Angina ◽

Class I ◽

Free Text ◽

Health Administration ◽

Canadian Cardiovascular Society ◽

All Cause Mortality ◽

One Year

Background: Stable angina is estimated to affect more than 10 million Americans and is the presenting symptom in half of patients diagnosed with coronary disease. Documentation of angina severity resides as unstructured data and is often unavailable in large datasets. We used natural language processing (NLP) to identify Canadian Cardiovascular Society (CCS) angina class and determine the association with all-cause mortality in an integrated health system’s electronic records (EHR). Methods: We performed a historic cohort study using national Veterans Health Administration data between 1/1/06 and 12/31/13. Veterans with incident stable angina were identified by ICD-9-CM codes. We developed an NLP tool to extract CCS class from free text notes. Risk ratios (RR) for all-cause mortality at one year associated with CCS class were calculated using Poisson regression. Results: There were 299,577 Veterans with angina, of which 14,216 had at least one CCS class extracted via NLP. Mean age was 66.6 years, 98% were male sex, and 82% were white. Diabetes increased with CCS class, but other comorbidities were stable (Table). There were 719 deaths at one year follow-up. The adjusted RR for all-cause mortality at one-year comparing Class III to Class I and Class IV to Class I was 1.40 (95% CI 1.16 - 1.68) and 1.52 (95% CI 1.13 - 2.04), respectively. Conclusion: NLP-derived CCS class was independently associated with one year all-cause mortality. Its application may be limited by inadequate EHR documentation of angina severity.

Download Full-text

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy173 ◽

2019 ◽

Vol 26 (4) ◽

pp. 364-379 ◽

Cited By ~ 34

Author(s):

Theresa A Koleck ◽

Caitlin Dreisbach ◽

Philip E Bourne ◽

Suzanne Bakken

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Healthcare Providers ◽

Patient Characteristics ◽

Disease Classification ◽

Free Text ◽

Health Records ◽

Electronic Health

Abstract Objective Natural language processing (NLP) of symptoms from electronic health records (EHRs) could contribute to the advancement of symptom science. We aim to synthesize the literature on the use of NLP to process or analyze symptom information documented in EHR free-text narratives. Materials and Methods Our search of 1964 records from PubMed and EMBASE was narrowed to 27 eligible articles. Data related to the purpose, free-text corpus, patients, symptoms, NLP methodology, evaluation metrics, and quality indicators were extracted for each study. Results Symptom-related information was presented as a primary outcome in 14 studies. EHR narratives represented various inpatient and outpatient clinical specialties, with general, cardiology, and mental health occurring most frequently. Studies encompassed a wide variety of symptoms, including shortness of breath, pain, nausea, dizziness, disturbed sleep, constipation, and depressed mood. NLP approaches included previously developed NLP tools, classification methods, and manually curated rule-based processing. Only one-third (n = 9) of studies reported patient demographic characteristics. Discussion NLP is used to extract information from EHR free-text narratives written by a variety of healthcare providers on an expansive range of symptoms across diverse clinical specialties. The current focus of this field is on the development of methods to extract symptom information and the use of symptom information for disease classification tasks rather than the examination of symptoms themselves. Conclusion Future NLP studies should concentrate on the investigation of symptoms and symptom documentation in EHR free-text narratives. Efforts should be undertaken to examine patient characteristics and make symptom-related NLP algorithms or pipelines and vocabularies openly available.

Download Full-text

Cohort profile: St. Michael's Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

10.1101/2020.09.11.20192419 ◽

2020 ◽

Author(s):

David Landsman ◽

Ahmed Abdelbasit ◽

Christine Wang ◽

Michael Guerzhoy ◽

Ujash Joshi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Epidemiology ◽

Relevant Information ◽

Structured Data ◽

Free Text ◽

Electronic Health Record Data ◽

Latent Tb ◽

Electronic Health

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael's Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael's Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N=3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4-21.2%) were diagnosed with active TB and 46% (95% CI: 43.8-47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8-19.7%) and 40% (95% CI: 37.8-41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modelling studies.

Download Full-text

Cohort profile: St. Michael’s Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

PLoS ONE ◽

10.1371/journal.pone.0247872 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0247872

Author(s):

David Landsman ◽

Ahmed Abdelbasit ◽

Christine Wang ◽

Michael Guerzhoy ◽

Ujash Joshi ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Epidemiology ◽

Relevant Information ◽

Structured Data ◽

Free Text ◽

Electronic Health Record Data ◽

Latent Tb ◽

Electronic Health

Background Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael’s Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment. Methods We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael’s Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors. Results SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N = 3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4–21.2%) were diagnosed with active TB and 46% (95% CI: 43.8–47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8–19.7%) and 40% (95% CI: 37.8–41.6%) respectively Conclusion SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data by using NLP rulesets. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modeling studies.

Download Full-text

Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods (Preprint)

10.2196/preprints.13039 ◽

2018 ◽

Author(s):

Tao Chen ◽

Mark Dredze ◽

Jonathan P Weiner ◽

Leilani Hernandez ◽

Joe Kimura ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Geriatric Syndrome ◽

Free Text ◽

Geriatric Syndromes ◽

Health Maintenance ◽

Clinical Notes ◽

Statistical Natural Language Processing ◽

Electronic Health

BACKGROUND Geriatric syndromes in older adults are associated with adverse outcomes. However, despite being reported in clinical notes, these syndromes are often poorly captured by diagnostic codes in the structured fields of electronic health records (EHRs) or administrative records. OBJECTIVE We aim to automatically determine if a patient has any geriatric syndromes by mining the free text of associated EHR clinical notes. We assessed which statistical natural language processing (NLP) techniques are most effective. METHODS We applied conditional random fields (CRFs), a widely used machine learning algorithm, to identify each of 10 geriatric syndrome constructs in a clinical note. We assessed three sets of features and attributes for CRF operations: a base set, enhanced token, and contextual features. We trained the CRF on 3901 manually annotated notes from 85 patients, tuned the CRF on a validation set of 50 patients, and evaluated it on 50 held-out test patients. These notes were from a group of US Medicare patients over 65 years of age enrolled in a Medicare Advantage Health Maintenance Organization and cared for by a large group practice in Massachusetts. RESULTS A final feature set was formed through comprehensive feature ablation experiments. The final CRF model performed well at patient-level determination (macroaverage F1=0.834, microaverage F1=0.851); however, performance varied by construct. For example, at phrase-partial evaluation, the CRF model worked well on constructs such as absence of fecal control (F1=0.857) and vision impairment (F1=0.798) but poorly on malnutrition (F1=0.155), weight loss (F1=0.394), and severe urinary control issues (F1=0.532). Errors were primarily due to previously unobserved words (ie, out-of-vocabulary) and a lack of context. CONCLUSIONS This study shows that statistical NLP can be used to identify geriatric syndromes from EHR-extracted clinical notes. This creates new opportunities to identify patients with geriatric syndromes and study their health outcomes.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text

Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100262 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100262

Author(s):

Mustafa Khanbhai ◽

Patrick Anyadi ◽

Joshua Symons ◽

Kelsey Flott ◽

Ara Darzi ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Patient Experience ◽

Language Processing ◽

Performance Metrics ◽

Free Text ◽

Patient Feedback

ObjectivesUnstructured free-text patient feedback contains rich information, and analysing these data manually would require a lot of personnel resources which are not available in most healthcare organisations.To undertake a systematic review of the literature on the use of natural language processing (NLP) and machine learning (ML) to process and analyse free-text patient experience data.MethodsDatabases were systematically searched to identify articles published between January 2000 and December 2019 examining NLP to analyse free-text patient feedback. Due to the heterogeneous nature of the studies, a narrative synthesis was deemed most appropriate. Data related to the study purpose, corpus, methodology, performance metrics and indicators of quality were recorded.ResultsNineteen articles were included. The majority (80%) of studies applied language analysis techniques on patient feedback from social media sites (unsolicited) followed by structured surveys (solicited). Supervised learning was frequently used (n=9), followed by unsupervised (n=6) and semisupervised (n=3). Comments extracted from social media were analysed using an unsupervised approach, and free-text comments held within structured surveys were analysed using a supervised approach. Reported performance metrics included the precision, recall and F-measure, with support vector machine and Naïve Bayes being the best performing ML classifiers.ConclusionNLP and ML have emerged as an important tool for processing unstructured free text. Both supervised and unsupervised approaches have their role depending on the data source. With the advancement of data analysis tools, these techniques may be useful to healthcare organisations to generate insight from the volumes of unstructured free-text data.

Download Full-text