scholarly journals Applying active learning to high-throughput phenotyping algorithms for electronic health records data

2013 ◽  
Vol 20 (e2) ◽  
pp. e253-e259 ◽  
Author(s):  
Yukun Chen ◽  
Robert J Carroll ◽  
Eugenia R McPeek Hinz ◽  
Anushi Shah ◽  
Anne E Eyler ◽  
...  
2013 ◽  
Vol 20 (e2) ◽  
pp. e341-e348 ◽  
Author(s):  
Jyotishman Pathak ◽  
Kent R Bailey ◽  
Calvin E Beebe ◽  
Steven Bethard ◽  
David S Carrell ◽  
...  

2018 ◽  
Vol 83 (12) ◽  
pp. 997-1004 ◽  
Author(s):  
Thomas H. McCoy ◽  
Sheng Yu ◽  
Kamber L. Hart ◽  
Victor M. Castro ◽  
Hannah E. Brown ◽  
...  

2017 ◽  
Author(s):  
Chia-Yen Chen ◽  
Phil H. Lee ◽  
Victor M. Castro ◽  
Jessica Minnier ◽  
Alexander W. Charney ◽  
...  

AbstractBipolar disorder (BD) is a heritable mood disorder characterized by episodes of mania and depression. Although genomewide association studies (GWAS) have successfully identified genetic loci contributing to BD risk, sample size has become a rate-limiting obstacle to genetic discovery. Electronic health records (EHRs) represent a vast but relatively untapped resource for high-throughput phenotyping. As part of the International Cohort Collection for Bipolar Disorder (ICCBD), we previously validated automated EHR-based phenotyping algorithms for BD against in-person diagnostic interviews (Castro et al. 2015). Here, we establish the genetic validity of these phenotypes by determining their genetic correlation with traditionally-ascertained samples. Case and control algorithms were derived from structured and narrative text in the Partners Healthcare system comprising more than 4.6 million patients over 20 years. Genomewide genotype data for 3,330 BD cases and 3,952 controls of European ancestry were used to estimate SNP-based heritability (h2g) and genetic correlation(rg) between EHR-based phenotype definitions and traditionally-ascertained BD cases in GWAS by the ICCBD and Psychiatric Genomics Consortium (PGC) using LD score regression. We evaluated BD cases identified using 4 EHR-based algorithms: an NLP-based algorithm (95-NLP) and 3 rule-based algorithms using codified EHR with decreasing levels of stringency - “coded-strict”, “coded-broad”, and “coded-broad based on a single clinical encounter” (coded-broad-SV). The analytic sample comprised 862 95-NLP, 1,968 coded-strict, 2,581 coded-broad, 408 coded-broad-SV BD cases, and 3,952 controls. The estimated h2g were 0.24 (p=0.015), 0.09 (p=0.064), 0.13 (p=0.003), 0.00 (p=0.591) for 95-NLP, coded-strict, coded-broad and coded-broad-SV BD, respectively. The h2g for all EHR-based cases combined except coded-broad-SV (excluded due to 0 h2g) was 0.12 (p=0.004). These h2g were lower or similar to the h2g observed by the ICCBD+PGCBD (0.23, p=3.17E-80, total N=33,181). However, the rg between ICCBD+PGCBD and the EHR-based cases were high for 95-NLP (0.66, p=3.69x10-5), coded-strict (1.00, p=2.40x10-4), and coded-broad (0.74, p=8.11x10-7). The rg between EHR-based BDs ranged from 0.90 to 0.98. These results provide the first genetic validation of automated EHR-based phenotyping for BD and suggest that this approach identifies cases that are highly genetically correlated with those ascertained through conventional methods. High throughput phenotyping using the large data resources available in EHRs represents a viable method for accelerating psychiatric genetic research.


2020 ◽  
Vol 27 (11) ◽  
pp. 1675-1687
Author(s):  
Neil S Zheng ◽  
QiPing Feng ◽  
V Eric Kerchberger ◽  
Juan Zhao ◽  
Todd L Edwards ◽  
...  

Abstract Objective Developing algorithms to extract phenotypes from electronic health records (EHRs) can be challenging and time-consuming. We developed PheMap, a high-throughput phenotyping approach that leverages multiple independent, online resources to streamline the phenotyping process within EHRs. Materials and Methods PheMap is a knowledge base of medical concepts with quantified relationships to phenotypes that have been extracted by natural language processing from publicly available resources. PheMap searches EHRs for each phenotype’s quantified concepts and uses them to calculate an individual’s probability of having this phenotype. We compared PheMap to clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network for type 2 diabetes mellitus (T2DM), dementia, and hypothyroidism using 84 821 individuals from Vanderbilt Univeresity Medical Center's BioVU DNA Biobank. We implemented PheMap-based phenotypes for genome-wide association studies (GWAS) for T2DM, dementia, and hypothyroidism, and phenome-wide association studies (PheWAS) for variants in FTO, HLA-DRB1, and TCF7L2. Results In this initial iteration, the PheMap knowledge base contains quantified concepts for 841 disease phenotypes. For T2DM, dementia, and hypothyroidism, the accuracy of the PheMap phenotypes were >97% using a 50% threshold and eMERGE case-control status as a reference standard. In the GWAS analyses, PheMap-derived phenotype probabilities replicated 43 of 51 previously reported disease-associated variants for the 3 phenotypes. For 9 of the 11 top associations, PheMap provided an equivalent or more significant P value than eMERGE-based phenotypes. The PheMap-based PheWAS showed comparable or better performance to a traditional phecode-based PheWAS. PheMap is publicly available online. Conclusions PheMap significantly streamlines the process of extracting research-quality phenotype information from EHRs, with comparable or better performance to current phenotyping approaches.


2019 ◽  
Vol 111 (1) ◽  
pp. 110-121 ◽  
Author(s):  
Bianca Vora ◽  
Elizabeth A E Green ◽  
Natalia Khuri ◽  
Frida Ballgren ◽  
Marina Sirota ◽  
...  

ABSTRACT Background Transporter-mediated drug–nutrient interactions have the potential to cause serious adverse events. However, unlike drug–drug interactions, these drug–nutrient interactions receive little attention during drug development. The clinical importance of drug–nutrient interactions was highlighted when a phase III clinical trial was terminated due to severe adverse events resulting from potent inhibition of thiamine transporter 2 (ThTR-2; SLC19A3). Objective In this study, we tested the hypothesis that therapeutic drugs inhibit the intestinal thiamine transporter ThTR-2, which may lead to thiamine deficiency. Methods For this exploration, we took a multifaceted approach, starting with a high-throughput in vitro primary screen to identify inhibitors, building in silico models to characterize inhibitors, and leveraging real-world data from electronic health records to begin to understand the clinical relevance of these inhibitors. Results Our high-throughput screen of 1360 compounds, including many clinically used drugs, identified 146 potential inhibitors at 200 μM. Inhibition kinetics were determined for 28 drugs with half-maximal inhibitory concentration (IC50) values ranging from 1.03 μM to >1 mM. Several oral drugs, including metformin, were predicted to have intestinal concentrations that may result in ThTR-2–mediated drug–nutrient interactions. Complementary analysis using electronic health records suggested that thiamine laboratory values are reduced in individuals receiving prescription drugs found to significantly inhibit ThTR-2, particularly in vulnerable populations (e.g., individuals with alcoholism). Conclusions Our comprehensive analysis of prescription drugs suggests that several marketed drugs inhibit ThTR-2, which may contribute to thiamine deficiency, especially in at-risk populations.


2020 ◽  
Vol 23 (1) ◽  
pp. 21-26 ◽  
Author(s):  
Nemanja Vaci ◽  
Qiang Liu ◽  
Andrey Kormilitzin ◽  
Franco De Crescenzo ◽  
Ayse Kurtulmus ◽  
...  

BackgroundUtilisation of routinely collected electronic health records from secondary care offers unprecedented possibilities for medical science research but can also present difficulties. One key issue is that medical information is presented as free-form text and, therefore, requires time commitment from clinicians to manually extract salient information. Natural language processing (NLP) methods can be used to automatically extract clinically relevant information.ObjectiveOur aim is to use natural language processing (NLP) to capture real-world data on individuals with depression from the Clinical Record Interactive Search (CRIS) clinical text to foster the use of electronic healthcare data in mental health research.MethodsWe used a combination of methods to extract salient information from electronic health records. First, clinical experts define the information of interest and subsequently build the training and testing corpora for statistical models. Second, we built and fine-tuned the statistical models using active learning procedures.FindingsResults show a high degree of accuracy in the extraction of drug-related information. Contrastingly, a much lower degree of accuracy is demonstrated in relation to auxiliary variables. In combination with state-of-the-art active learning paradigms, the performance of the model increases considerably.ConclusionsThis study illustrates the feasibility of using the natural language processing models and proposes a research pipeline to be used for accurately extracting information from electronic health records.Clinical implicationsReal-world, individual patient data are an invaluable source of information, which can be used to better personalise treatment.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (6) ◽  
pp. e1009593
Author(s):  
Neil S. Zheng ◽  
Cosby A. Stone ◽  
Lan Jiang ◽  
Christian M. Shaffer ◽  
V. Eric Kerchberger ◽  
...  

Understanding the contribution of genetic variation to drug response can improve the delivery of precision medicine. However, genome-wide association studies (GWAS) for drug response are uncommon and are often hindered by small sample sizes. We present a high-throughput framework to efficiently identify eligible patients for genetic studies of adverse drug reactions (ADRs) using “drug allergy” labels from electronic health records (EHRs). As a proof-of-concept, we conducted GWAS for ADRs to 14 common drug/drug groups with 81,739 individuals from Vanderbilt University Medical Center’s BioVU DNA Biobank. We identified 7 genetic loci associated with ADRs at P < 5 × 10−8, including known genetic associations such as CYP2D6 and OPRM1 for CYP2D6-metabolized opioid ADR. Additional expression quantitative trait loci and phenome-wide association analyses added evidence to the observed associations. Our high-throughput framework is both scalable and portable, enabling impactful pharmacogenomic research to improve precision medicine.


Sign in / Sign up

Export Citation Format

Share Document