scholarly journals Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data

JAMIA Open ◽  
2019 ◽  
Vol 2 (4) ◽  
pp. 570-579 ◽  
Author(s):  
Na Hong ◽  
Andrew Wen ◽  
Feichen Shen ◽  
Sunghwan Sohn ◽  
Chen Wang ◽  
...  

Abstract Objective To design, develop, and evaluate a scalable clinical data normalization pipeline for standardizing unstructured electronic health record (EHR) data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification. Methods We established an FHIR-based clinical data normalization pipeline known as NLP2FHIR that mainly comprises: (1) a module for a core natural language processing (NLP) engine with an FHIR-based type system; (2) a module for integrating structured data; and (3) a module for content normalization. We evaluated the FHIR modeling capability focusing on core clinical resources such as Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory using Mayo Clinic’s unstructured EHR data. We constructed a gold standard reusing annotation corpora from previous NLP projects. Results A total of 30 mapping rules, 62 normalization rules, and 11 NLP-specific FHIR extensions were created and implemented in the NLP2FHIR pipeline. The elements that need to integrate structured data from each clinical resource were identified. The performance of unstructured data modeling achieved F scores ranging from 0.69 to 0.99 for various FHIR element representations (0.69–0.99 for Condition; 0.75–0.84 for Procedure; 0.71–0.99 for MedicationStatement; and 0.75–0.95 for FamilyMemberHistory). Conclusion We demonstrated that the NLP2FHIR pipeline is feasible for modeling unstructured EHR data and integrating structured elements into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data analytics, as well as useful insights for future developments of the FHIR specifications with regard to handling unstructured clinical data.

2020 ◽  
Vol 16 (3) ◽  
pp. 531-540 ◽  
Author(s):  
Thomas H. McCoy ◽  
Larry Han ◽  
Amelia M. Pellegrini ◽  
Rudolph E. Tanzi ◽  
Sabina Berretta ◽  
...  

2014 ◽  
Vol 23 (01) ◽  
pp. 97-104 ◽  
Author(s):  
M. K. Ross ◽  
Wei Wei ◽  
L. Ohno-Machado

Summary Objectives: Implementation of Electronic Health Record (EHR) systems continues to expand. The massive number of patient encounters results in high amounts of stored data. Transforming clinical data into knowledge to improve patient care has been the goal of biomedical informatics professionals for many decades, and this work is now increasingly recognized outside our field. In reviewing the literature for the past three years, we focus on “big data” in the context of EHR systems and we report on some examples of how secondary use of data has been put into practice. Methods: We searched PubMed database for articles from January 1, 2011 to November 1, 2013. We initiated the search with keywords related to “big data” and EHR. We identified relevant articles and additional keywords from the retrieved articles were added. Based on the new keywords, more articles were retrieved and we manually narrowed down the set utilizing predefined inclusion and exclusion criteria. Results: Our final review includes articles categorized into the themes of data mining (pharmacovigilance, phenotyping, natural language processing), data application and integration (clinical decision support, personal monitoring, social media), and privacy and security. Conclusion: The increasing adoption of EHR systems worldwide makes it possible to capture large amounts of clinical data. There is an increasing number of articles addressing the theme of “big data”, and the concepts associated with these articles vary. The next step is to transform healthcare big data into actionable knowledge.


2021 ◽  
pp. 103879
Author(s):  
Lin Liu ◽  
Ranier Bustamante ◽  
Ashley Earles ◽  
Joshua Demb ◽  
Karen Messer ◽  
...  

2014 ◽  
Vol 23 (01) ◽  
pp. 215-223 ◽  
Author(s):  
M. M. Horvath ◽  
S. A. Rusincovitch ◽  
R. L. Richesson

Summary Objectives: The goal of this survey is to discuss the impact of the growing availability of electronic health record (EHR) data on the evolving field of Clinical Research Informatics (CRI), which is the union of biomedical research and informatics. Results: Major challenges for the use of EHR-derived data for research include the lack of standard methods for ensuring that data quality, completeness, and provenance are sufficient to assess the appropriateness of its use for research. Areas that need continued emphasis include methods for integrating data from heterogeneous sources, guidelines (including explicit phenotype definitions) for using these data in both pragmatic clinical trials and observational investigations, strong data governance to better understand and control quality of enterprise data, and promotion of national standards for representing and using clinical data. Conclusions: The use of EHR data has become a priority in CRI. Awareness of underlying clinical data collection processes will be essential in order to leverage these data for clinical research and patient care, and will require multi-disciplinary teams representing clinical research, informatics, and healthcare operations. Considerations for the use of EHR data provide a starting point for practical applications and a CRI research agenda, which will be facilitated by CRI’s key role in the infrastructure of a learning healthcare system.


2016 ◽  
Vol 24 (1) ◽  
pp. 162-171 ◽  
Author(s):  
Pedro L Teixeira ◽  
Wei-Qi Wei ◽  
Robert M Cronin ◽  
Huan Mo ◽  
Jacob P VanHouten ◽  
...  

Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites. Materials and Methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic. Results: Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar. Conclusion: This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.


2021 ◽  
Author(s):  
Yuri Ahuja ◽  
Liang Liang ◽  
Sicong Huang ◽  
Tianxi Cai

Leveraging large-scale electronic health record (EHR) data to estimate survival curves for clinical events can enable more powerful risk estimation and comparative effectiveness research. However, use of EHR data is hindered by a lack of direct event times observations. Occurrence times of relevant diagnostic codes or target disease mentions in clinical notes are at best a good approximation of the true disease onset time. On the other hand, extracting precise information on the exact event time requires laborious manual chart review and is sometimes altogether infeasible due to a lack of detailed documentation. Current status labels -- binary indicators of phenotype status during follow up -- are significantly more efficient and feasible to compile, enabling more precise survival curve estimation given limited resources. Existing survival analysis methods using current status labels focus almost entirely on supervised estimation, and naive incorporation of unlabeled data into these methods may lead to biased results. In this paper we propose Semi-supervised Calibration of Risk with Noisy Event Times (SCORNET), which yields a consistent and efficient survival curve estimator by leveraging a small size of current status labels and a large size of imperfect surrogate features. In addition to providing theoretical justification of SCORNET, we demonstrate in both simulation and real-world EHR settings that SCORNET achieves efficiency akin to the parametric Weibull regression model, while also exhibiting non-parametric flexibility and relatively low empirical bias in a variety of generative settings.


Medical Care ◽  
2019 ◽  
Vol 57 (10) ◽  
pp. e60-e64 ◽  
Author(s):  
Ranier Bustamante ◽  
Ashley Earles ◽  
James D. Murphy ◽  
Alex K. Bryant ◽  
Olga V. Patterson ◽  
...  

2021 ◽  
pp. 263208432110612
Author(s):  
Joseph Grant Brazeal ◽  
Alexander V Alekseyenko ◽  
Hong Li ◽  
Mario Fugal ◽  
Katie Kirchoff ◽  
...  

Objective We evaluate data agreement between an electronic health record (EHR) sample abstracted by automated characterization with a standard abstracted by manual review. Study Design and Setting We obtain data for an epidemiology cohort study using standard manual abstraction of the EHR and automated identification of the same patients using a structured algorithm to query the EHR. Summary measures of agreement (e.g., Cohen’s kappa) are reported for 12 variables commonly used in epidemiological studies. Results Best agreement between abstraction methods is observed among demographic characteristics such as age, sex, and race, and for positive history of disease. Poor agreement is found in missing data and negative history, suggesting potential impact for researchers using automated EHR characterization. EHR data quality depends upon providers, who may be influenced by both institutional and federal government documentation guidelines. Conclusion Automated EHR abstraction discrepancies may decrease power and increase bias; therefore, caution is warranted when selecting variables from EHRs for epidemiological study using an automated characterization approach. Validation of automated methods must also continue to advance in sophistication with other technologies, such as machine learning and natural language processing, to extract non-structured data from the EHR, for application to EHR characterization for clinical epidemiology.


Sign in / Sign up

Export Citation Format

Share Document