scholarly journals Surrogate-assisted feature extraction for high-throughput phenotyping

2016 ◽  
Vol 24 (e1) ◽  
pp. e143-e149 ◽  
Author(s):  
Sheng Yu ◽  
Abhishek Chakrabortty ◽  
Katherine P Liao ◽  
Tianrun Cai ◽  
Ashwin N Ananthakrishnan ◽  
...  

Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype’s International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn’s disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.

2015 ◽  
Vol 22 (5) ◽  
pp. 993-1000 ◽  
Author(s):  
Sheng Yu ◽  
Katherine P Liao ◽  
Stanley Y Shaw ◽  
Vivian S Gainer ◽  
Susanne E Churchill ◽  
...  

Abstract Objective Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to develop phenotyping algorithms in an unbiased manner by automatically extracting and selecting informative features, which can be comparable to expert-curated ones in classification accuracy. Materials and methods Comprehensive medical concepts were collected from publicly available knowledge sources in an automated, unbiased fashion. Natural language processing (NLP) revealed the occurrence patterns of these concepts in EHR narrative notes, which enabled selection of informative features for phenotype classification. When combined with additional codified features, a penalized logistic regression model was trained to classify the target phenotype. Results The authors applied our method to develop algorithms to identify patients with rheumatoid arthritis and coronary artery disease cases among those with rheumatoid arthritis from a large multi-institutional EHR. The area under the receiver operating characteristic curves (AUC) for classifying RA and CAD using models trained with automated features were 0.951 and 0.929, respectively, compared to the AUCs of 0.938 and 0.929 by models trained with expert-curated features. Discussion Models trained with NLP text features selected through an unbiased, automated procedure achieved comparable or slightly higher accuracy than those trained with expert-curated features. The majority of the selected model features were interpretable. Conclusion The proposed automated feature extraction method, generating highly accurate phenotyping algorithms with improved efficiency, is a significant step toward high-throughput phenotyping.


Rheumatology ◽  
2020 ◽  
Vol 59 (12) ◽  
pp. 3759-3766 ◽  
Author(s):  
Sicong Huang ◽  
Jie Huang ◽  
Tianrun Cai ◽  
Kumar P Dahal ◽  
Andrew Cagan ◽  
...  

Abstract Objective The objective of this study was to compare the performance of an RA algorithm developed and trained in 2010 utilizing natural language processing and machine learning, using updated data containing ICD10, new RA treatments, and a new electronic medical records (EMR) system. Methods We extracted data from subjects with ≥1 RA International Classification of Diseases (ICD) codes from the EMR of two large academic centres to create a data mart. Gold standard RA cases were identified from reviewing a random 200 subjects from the data mart, and a random 100 subjects who only have RA ICD10 codes. We compared the performance of the following algorithms using the original 2010 data with updated data: (i) a published 2010 RA algorithm; (ii) updated algorithm, incorporating ICD10 RA codes and new DMARDs; and (iii) published algorithm using ICD codes only, ICD RA code ≥3. Results The gold standard RA cases had mean age 65.5 years, 78.7% female, 74.1% RF or antibodies to cyclic citrullinated peptide (anti-CCP) positive. The positive predictive value (PPV) for ≥3 RA ICD was 54%, compared with 56% in 2010. At a specificity of 95%, the PPV of the 2010 algorithm and the updated version were both 91%, compared with 94% (95% CI: 91, 96%) in 2010. In subjects with ICD10 data only, the PPV for the updated 2010 RA algorithm was 93%. Conclusion The 2010 RA algorithm validated with the updated data with similar performance characteristics as the 2010 data. While the 2010 algorithm continued to perform better than the rule-based approach, the PPV of the latter also remained stable over time.


2021 ◽  
Vol 4 (1) ◽  
pp. 1-19
Author(s):  
Lisa Bastarache

Electronic health records (EHRs) are a rich source of data for researchers, but extracting meaningful information out of this highly complex data source is challenging. Phecodes represent one strategy for defining phenotypes for research using EHR data. They are a high-throughput phenotyping tool based on ICD (International Classification of Diseases) codes that can be used to rapidly define the case/control status of thousands of clinically meaningful diseases and conditions. Phecodes were originally developed to conduct phenome-wide association studies to scan for phenotypic associations with common genetic variants. Since then, phecodes have been used to support a wide range of EHR-based phenotyping methods, including the phenotype risk score. This review aims to comprehensively describe the development, validation, and applications of phecodes and suggest some future directions for phecodes and high-throughput phenotyping.


2019 ◽  
Author(s):  
Katherine P. Liao ◽  
Jiehuan Sun ◽  
Tianrun A. Cai ◽  
Nicholas Link ◽  
Chuan Hong ◽  
...  

AbstractObjectiveElectronic health records (EHR) linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP).MethodWe developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the UMLS. Aggregated ICD and NLP counts along with healthcare utilization were jointly analyzed by fitting an ensemble of latent mixture models. The MAP algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying subjects with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort PheWAS for two SNPs with known associations.ResultsThe MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes.ConclusionThe MAP approach increased the accuracy of phenotype definition while maintaining scalability, facilitating use in studies requiring large scale phenotyping, such as PheWAS.


2021 ◽  
Author(s):  
Felix M. Bauer ◽  
Lena Lärm ◽  
Shehan Morandage ◽  
Guillaume Lobet ◽  
Jan Vanderborght ◽  
...  

Root systems of crops play a significant role in agro-ecosystems. The root system is essential for water and nutrient uptake, plant stability, symbiosis with microbes and a good soil structure. Minirhizotrons, consisting of transparent tubes that create windows into the soil, have shown to be effective to non-invasively investigate the root system. Root traits, like root length observed around the tubes of minirhizotron, can therefore be obtained throughout the crop growing season. Analyzing datasets from minirhizotrons using common manual annotation methods, with conventional software tools, are time consuming and labor intensive. Therefore, an objective method for high throughput image analysis that provides data for field root-phenotyping is necessary. In this study we developed a pipeline combining state-of-the-art software tools, using deep neural networks and automated feature extraction. This pipeline consists of two major components and was applied to large root image datasets from minirhizotrons. First, a segmentation by a neural network model, trained with a small image sample is performed. Training and segmentation are done using “Root-Painter”. Then, an automated feature extraction from the segments is carried out by “RhizoVision Explorer”. To validate the results of our automated analysis pipeline, a comparison of root length between manually annotated and automatically processed data was realized with more than 58,000 images. Mainly the results show a high correlation (R=0.81) between manually and automatically determined root lengths. With respect to the processing time, our new pipeline outperforms manual annotation by 98.1 - 99.6 %. Our pipeline,combining state-of-the-art software tools, significantly reduces the processing time for minirhizotron images. Thus, image analysis is no longer the bottle-neck in high-throughput phenotyping approaches.


2020 ◽  
Vol 2020 ◽  
pp. 1-8
Author(s):  
Ronghao Wang ◽  
Yumou Qiu ◽  
Yuzhen Zhou ◽  
Zhikai Liang ◽  
James C. Schnable

High-throughput phenotyping system has become more and more popular in plant science research. The data analysis for such a system typically involves two steps: plant feature extraction through image processing and statistical analysis for the extracted features. The current approach is to perform those two steps on different platforms. We develop the package “implant” in R for both robust feature extraction and functional data analysis. For image processing, the “implant” package provides methods including thresholding, hidden Markov random field model, and morphological operations. For statistical analysis, this package can produce nonparametric curve fitting with its confidence region for plant growth. A functional ANOVA model to test for the treatment and genotype effects on the plant growth dynamics is also provided.


JAMIA Open ◽  
2020 ◽  
Vol 3 (2) ◽  
pp. 185-189
Author(s):  
Timothy A Miller ◽  
Paul Avillach ◽  
Kenneth D Mandl

Abstract Objective To develop scalable natural language processing (NLP) infrastructure for processing the free text in electronic health records (EHRs). Materials and Methods We extend the open-source Apache cTAKES NLP software with several standard technologies for scalability. We remove processing bottlenecks by monitoring component queue size. We process EHR free text for patients in the PrecisionLink Biobank at Boston Children’s Hospital. The extracted concepts are made searchable via a web-based portal. Results We processed over 1.2 million notes for over 8000 patients, extracting 154 million concepts. Our largest tested configuration processes over 1 million notes per day. Discussion The unique information represented by extracted NLP concepts has great potential to provide a more complete picture of patient status. Conclusion NLP large EHR document collections can be done efficiently, in service of high throughput phenotyping.


2011 ◽  
Author(s):  
E. Kyzar ◽  
S. Gaikwad ◽  
M. Pham ◽  
J. Green ◽  
A. Roth ◽  
...  

2021 ◽  
Author(s):  
Peng Song ◽  
Jinglu Wang ◽  
Xinyu Guo ◽  
Wanneng Yang ◽  
Chunjiang Zhao

Sign in / Sign up

Export Citation Format

Share Document