scholarly journals MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record

2021 ◽  
Author(s):  
Yuri Ahuja ◽  
Yuesong Zou ◽  
Aman Verma ◽  
David Buckeridge ◽  
Yue Li

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.

2018 ◽  
Vol 9 (1) ◽  
pp. 204589401881477 ◽  
Author(s):  
Simon Teal ◽  
William R. Auger ◽  
Rodney J. Hughes ◽  
Dena Rosen Ramey ◽  
Kelly S. Lewis ◽  
...  

This study aimed to validate an algorithm developed to identify chronic thromboembolic pulmonary hypertension (CTEPH) among patients with a history of pulmonary embolism. Validation was halted because too few patients had gold-standard evidence of CTEPH in the administrative claims/electronic health records database, suggesting that CTEPH is underdiagnosed.


2018 ◽  
pp. 1-12 ◽  
Author(s):  
Ashley Earles ◽  
Lin Liu ◽  
Ranier Bustamante ◽  
Pat Coke ◽  
Julie Lynch ◽  
...  

Purpose Cancer ascertainment using large-scale electronic health records is a challenge. Our aim was to propose and apply a structured approach for evaluating multiple candidate approaches for cancer ascertainment using colorectal cancer (CRC) ascertainment within the US Department of Veterans Affairs (VA) as a use case. Methods The proposed approach for evaluating cancer ascertainment strategies includes assessment of individual strategy performance, comparison of agreement across strategies, and review of discordant diagnoses. We applied this approach to compare three strategies for CRC ascertainment within the VA: administrative claims data consisting of International Classification of Diseases, Ninth Revision (ICD9) diagnosis codes; the VA Central Cancer Registry (VACCR); and the newly accessible Oncology Domain, consisting of cases abstracted by local cancer registrars. The study sample consisted of 1,839,043 veterans with index colonoscopy performed from 1999 to 2014. Strategy-specific performance was estimated based on manual record review of 100 candidate CRC cases and 100 colonoscopy controls. Strategies were further compared using Cohen’s κ and focused review of discordant CRC diagnoses. Results A total of 92,197 individuals met at least one CRC definition. All three strategies had high sensitivity and specificity for incident CRC. However, the ICD9-based strategy demonstrated poor positive predictive value (58%). VACCR and Oncology Domain had almost perfect agreement with each other (κ, 0.87) but only moderate agreement with ICD9-based diagnoses (κ, 0.51 and 0.57, respectively). Among discordant cases reviewed, 15% of ICD9-positive but VACCR- or Oncology Domain–negative cases had incident CRC. Conclusion Evaluating novel strategies for identifying cancer requires a structured approach, including validation against manual record review, agreement among candidate strategies, and focused review of discordant findings. Without careful assessment of ascertainment methods, analyses may be subject to bias and limited in clinical impact.


AERA Open ◽  
2021 ◽  
Vol 7 ◽  
pp. 233285842110456
Author(s):  
Joshua Littenberg-Tobias ◽  
Elizabeth Borneman ◽  
Justin Reich

Diversity, equity, and inclusion (DEI) issues are urgent in education. We developed and evaluated a massive open online course ( N = 963) with embedded equity simulations that attempted to equip educators with equity teaching practices. Applying a structural topic model (STM)—a type of natural language processing (NLP)—we examined how participants with different equity attitudes responded in simulations. Over a sequence of four simulations, the simulation behavior of participants with less equitable beliefs converged to be more similar with the simulated behavior of participants with more equitable beliefs ( ES [effect size] = 1.08 SD). This finding was corroborated by overall changes in equity mindsets ( ES = 0.88 SD) and changed in self-reported equity-promoting practices ( ES = 0.32 SD). Digital simulations when combined with NLP offer a compelling approach to both teaching about DEI topics and formatively assessing learner behavior in large-scale learning environments.


PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0252502
Author(s):  
Qiqing Wang ◽  
Cunbin Li

This study investigates the evolution of provincial new energy policies and industries of China using a topic modeling approach. To this end, six out of 31 provinces in China are first selected as research samples, central and provincial new energy policies in the period of 2010 to 2019 are collected to establish a text corpus with 23, 674 documents. Then, the policy corpus is fed to two different topic models, one is the Latent Dirichlet Allocation for modeling static policy topics, another is the Dynamic Topic Model for extracting topics over time. Finally, the obtained topics are mapped into policy tools for comparisons. The dynamic policy topics are further analyzed with the panel data from provincial new energy industries. The results show that the provincial new energy policies moved to different tracks after about 2014 due to the regional conditions such as the economy and CO2 emission intensity. Underdeveloped provinces tend to use environment-oriented tools to regulate and control CO2 emissions, while developed regions employ the more balanced policy mix for improving new energy vehicles and other industries. Widespread hysteretic effects are revealed during the correlation analysis of the policy topics and new energy capacity.


2021 ◽  
Vol 11 (12) ◽  
pp. 1296
Author(s):  
Roseann S. Gammal ◽  
Lucas A. Berenbrok ◽  
Philip E. Empey ◽  
Mylynda B. Massart

With increasing patient interest in and access to pharmacogenomic testing, clinicians practicing in primary care are more likely than ever to encounter a patient seeking or presenting with pharmacogenomic test results. Gene-based prescribing recommendations are available to healthcare providers through Food and Drug Administration-approved drug labeling and Clinical Pharmacogenetics Implementation Consortium guidelines. Given the lifelong utility of pharmacogenomic test results to optimize pharmacotherapy for commonly prescribed medications, appropriate documentation of these results in a patient’s electronic health record (EHR) is essential. The current “gold standard” for pharmacogenomics implementation includes entering pharmacogenomic test results into EHRs as discrete results with associated clinical decision support (CDS) alerts that will fire at the point of prescribing, similar to drug allergy alerts. However, such infrastructure is limited to the few institutions that have invested in the resources and personnel to develop and maintain it. For the majority of clinicians who do not practice at an institution with a dedicated clinical pharmacogenomics team and integrated pharmacogenomics CDS in the EHR, this report provides practical tips for documenting pharmacogenomic test results in the problem list and allergy field to maximize the visibility and utility of results over time, especially when such results could prevent the occurrence of serious adverse drug reactions or predict therapeutic failure.


2021 ◽  
Vol 5 (Supplement_1) ◽  
pp. 275-275
Author(s):  
Igor Akushevich ◽  
Carl V Hill ◽  
Konstantin Arbeev

Abstract The objective of the Symposium is to expand familiarity of the application of advanced methods of modern statistical modeling and data management, to administrative health data by combining methodological innovations with practical hands-on demonstrations. Topics will cover a range of methodological and substantive topics including: i) decomposition and partitioning approaches in analysis of disparities and time trends in AD/ADRD; ii) new artificial intelligence technologies that allow us to enrich electronic health record datasets with self-report scores in geriatrics; iii) using administrative data to model adherence to disease management and health-related behavior; iv) the use of longitudinal extension of the average attributable fraction to study health disparities and multimorbidity; and v) the geographic and racial disparities in total and remaining life expectancies after diagnoses of AD/ADRD and other chronic conditions. The increasing availability of large-scale datasets based on electronic health records and administrative claims records provide an unprecedented opportunity for obtaining nationally representative results based on individual-level measures that reflect the real care-related and epidemiological processes. This makes the reduction of barriers to entry to the use of such data of vital importance to the community of geriatrics and health researchers.


Author(s):  
Yuri Ahuja ◽  
Doudou Zhou ◽  
Zeling He ◽  
Jiehuan Sun ◽  
Victor M. Castro ◽  
...  

ABSTRACTObjectiveA major bottleneck hindering utilization of electronic health record (EHR) data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though ICD codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes.MethodssureLDA is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on two surrogate features for each target phenotype, and then leverages these probabilities to constrain the Latent Dirichlet Allocation (LDA) topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities.ResultssureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate versus non-surrogate features. It also exhibits powerful feature selection properties.DiscussionsureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA’s feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes.ConclusionsureLDA is well suited toward large-scale EHR phenotyping for highly multi-phenotype applications such as PheWAS.


2020 ◽  
Author(s):  
Amir Karami ◽  
Brandon Bookstaver ◽  
Melissa Nolan

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.


Sign in / Sign up

Export Citation Format

Share Document