A Framework for Criteria-Based Selection and Processing of Fast Healthcare Interoperability Resources (FHIR) Data for Statistical Analysis: Design and Implementation Study

Background The harmonization and standardization of digital medical information for research purposes is a challenging and ongoing collaborative effort. Current research data repositories typically require extensive efforts in harmonizing and transforming original clinical data. The Fast Healthcare Interoperability Resources (FHIR) format was designed primarily to represent clinical processes; therefore, it closely resembles the clinical data model and is more widely available across modern electronic health records. However, no common standardized data format is directly suitable for statistical analyses, and data need to be preprocessed before statistical analysis. Objective This study aimed to elucidate how FHIR data can be queried directly with a preprocessing service and be used for statistical analyses. Methods We propose that the binary JavaScript Object Notation format of the PostgreSQL (PSQL) open source database is suitable for not only storing FHIR data, but also extending it with preprocessing and filtering services, which directly transform data stored in FHIR format into prepared data subsets for statistical analysis. We specified an interface for this preprocessor, implemented and deployed it at University Hospital Erlangen-Nürnberg, generated 3 sample data sets, and analyzed the available data. Results We imported real-world patient data from 2016 to 2018 into a standard PSQL database, generating a dataset of approximately 35.5 million FHIR resources, including “Patient,” “Encounter,” “Condition” (diagnoses specified using International Classification of Diseases codes), “Procedure,” and “Observation” (laboratory test results). We then integrated the developed preprocessing service with the PSQL database and the locally installed web-based KETOS analysis platform. Advanced statistical analyses were feasible using the developed framework using 3 clinically relevant scenarios (data-driven establishment of hemoglobin reference intervals, assessment of anemia prevalence in patients with cancer, and investigation of the adverse effects of drugs). Conclusions This study shows how the standard open source database PSQL can be used to store FHIR data and be integrated with a specifically developed preprocessing and analysis framework. This enables dataset generation with advanced medical criteria and the integration of subsequent statistical analysis. The web-based preprocessing service can be deployed locally at the hospital level, protecting patients’ privacy while being integrated with existing open source data analysis tools currently being developed across Germany.

Download Full-text

A Framework for Criteria-Based Selection and Processing of Fast Healthcare Interoperability Resources Data for Statistical Analysis: Design and Implementation Study (Preprint)

10.2196/preprints.25645 ◽

2020 ◽

Author(s):

Julian Gruendner ◽

Christian Gulden ◽

Marvin Kampf ◽

Sebastian Mate ◽

Hans-Ulrich Prokosch ◽

...

Keyword(s):

Statistical Analysis ◽

Open Source ◽

Clinical Data ◽

Medical Information ◽

Statistical Analyses ◽

University Hospital ◽

Data Repositories ◽

Web Based ◽

Source Database ◽

Laboratory Test Results

BACKGROUND The harmonization and standardization of digital medical information for research purposes is a challenging and ongoing collaborative effort. Current research data repositories typically require extensive efforts in harmonizing and transforming original clinical data. The Fast Healthcare Interoperability Resources (FHIR) format was designed primarily to represent clinical processes; therefore, it closely resembles the clinical data model and is more widely available across modern electronic health records. However, no common standardized data format is directly suitable for statistical analyses, and data need to be preprocessed before statistical analysis. OBJECTIVE This study aimed to elucidate how FHIR data can be queried directly with a preprocessing service and be used for statistical analyses. METHODS We propose that the binary JavaScript Object Notation format of the PostgreSQL (PSQL) open source database is suitable for not only storing FHIR data, but also extending it with preprocessing and filtering services, which directly transform data stored in FHIR format into prepared data subsets for statistical analysis. We specified an interface for this preprocessor, implemented and deployed it at University Hospital Erlangen-Nürnberg, generated 3 sample data sets, and analyzed the available data. RESULTS We imported real-world patient data from 2016 to 2018 into a standard PSQL database, generating a dataset of approximately 35.5 million FHIR resources, including “Patient,” “Encounter,” “Condition” (diagnoses specified using International Classification of Diseases codes), “Procedure,” and “Observation” (laboratory test results). We then integrated the developed preprocessing service with the PSQL database and the locally installed web-based KETOS analysis platform. Advanced statistical analyses were feasible using the developed framework using 3 clinically relevant scenarios (data-driven establishment of hemoglobin reference intervals, assessment of anemia prevalence in patients with cancer, and investigation of the adverse effects of drugs). CONCLUSIONS This study shows how the standard open source database PSQL can be used to store FHIR data and be integrated with a specifically developed preprocessing and analysis framework. This enables dataset generation with advanced medical criteria and the integration of subsequent statistical analysis. The web-based preprocessing service can be deployed locally at the hospital level, protecting patients’ privacy while being integrated with existing open source data analysis tools currently being developed across Germany.

Download Full-text

Development of web-based system for dynamic statistical analysis of clinical data

Journal of the Korean Data and Information Science Society ◽

10.7465/jkdi.2014.25.1.27 ◽

2014 ◽

Vol 25 (1) ◽

pp. 27-36 ◽

Cited By ~ 1

Author(s):

Im Hee Shin ◽

Sang Gyu Kwak ◽

Jun Woo Park

Keyword(s):

Statistical Analysis ◽

Clinical Data ◽

Web Based ◽

Web Based System

Download Full-text

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

Journal of Medical Internet Research ◽

10.2196/18735 ◽

2020 ◽

Vol 22 (11) ◽

pp. e18735

Author(s):

Yun William Yu ◽

Griffin M Weber

Keyword(s):

Clinical Data ◽

Medical Information ◽

Homomorphic Encryption ◽

Probabilistic Approach ◽

Large Networks ◽

Large Hospital ◽

Data Repositories ◽

Trade Offs ◽

Number Of Patients ◽

Federated Queries

Background Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. Objective This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. Methods We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. Results In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. Conclusions Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.

Download Full-text

Development of web based system for statistical analysis of clinical data

Journal of the Korean Data and Information Science Society ◽

10.7465/jkdi.2012.23.1.191 ◽

2012 ◽

Vol 23 (1) ◽

pp. 191-198 ◽

Cited By ~ 1

Author(s):

Dal-Ho Kim ◽

Im-Hee Shin ◽

Jung-Youn Choe ◽

Sang-Gyung Kim ◽

Chun-Woo Park ◽

...

Keyword(s):

Statistical Analysis ◽

Clinical Data ◽

Web Based ◽

Web Based System

Download Full-text

Open Source Infrastructure for Health Care Data Integration and Machine Learning Analyses

JCO Clinical Cancer Informatics ◽

10.1200/cci.18.00132 ◽

2019 ◽

pp. 1-16 ◽

Cited By ~ 2

Author(s):

Veli-Matti Isoviita ◽

Liina Salminen ◽

Jimmy Azar ◽

Rainer Lehtonen ◽

Pia Roering ◽

...

Keyword(s):

Machine Learning ◽

Data Integration ◽

Open Source ◽

Clinical Data ◽

Characteristic Curve ◽

Complete Response ◽

Learning System ◽

University Hospital ◽

Primary Therapy ◽

Data Set

PURPOSE We have created a cloud-based machine learning system (CLOBNET) that is an open-source, lean infrastructure for electronic health record (EHR) data integration and is capable of extract, transform, and load (ETL) processing. CLOBNET enables comprehensive analysis and visualization of structured EHR data. We demonstrate the utility of CLOBNET by predicting primary therapy outcomes of patients with high-grade serous ovarian cancer (HGSOC) on the basis of EHR data. MATERIALS AND METHODS CLOBNET is built using open-source software to make data preprocessing, analysis, and model training user friendly. The source code of CLOBNET is available in GitHub. The HGSOC data set was based on a prospective cohort of 208 patients with HGSOC who were treated at Turku University Hospital, Finland, from 2009 to 2019 for whom comprehensive clinical and EHR data were available. RESULTS We trained machine learning (ML) models using clinical data, including a herein developed dissemination score that quantifies the disease burden at the time of diagnosis, to identify patients with progressive disease (PD) or a complete response (CR) on the basis of RECIST (version 1.1). The best performance was achieved with a logistic regression model, which resulted in an area under receiver operating characteristic curve (AUROC) of 0.86, with a specificity of 73% and a sensitivity of 89%, when it classified between patients who experienced PD and CR. CONCLUSION We have developed an open-source computational infrastructure, CLOBNET, that enables effective and rapid analysis of EHR and other clinical data. Our results demonstrate that CLOBNET allows predictions to be made on the basis of EHR data to address clinically relevant questions.

Download Full-text

A Web-based open-source database for the distribution of hyperspectral signatures

10.1117/12.712664 ◽

2006 ◽

Cited By ~ 1

Author(s):

J. G. Ferwerda ◽

S. D. Jones ◽

Pei-Jun Du

Keyword(s):

Open Source ◽

Web Based ◽

Source Database ◽

Hyperspectral Signatures

Download Full-text

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation (Preprint)

10.2196/preprints.18735 ◽

2020 ◽

Author(s):

Yun William Yu ◽

Griffin M Weber

Keyword(s):

Clinical Data ◽

Medical Information ◽

Homomorphic Encryption ◽

Probabilistic Approach ◽

Large Networks ◽

Large Hospital ◽

Data Repositories ◽

Trade Offs ◽

Number Of Patients ◽

Federated Queries

BACKGROUND Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks. OBJECTIVE This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed. METHODS We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques. RESULTS In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer. CONCLUSIONS Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.

Download Full-text

Implementation of a web based universal exchange and inference language for medicine: Sparse data, probabilities and inference in data mining of clinical data repositories

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2015.07.015 ◽

2015 ◽

Vol 66 ◽

pp. 82-102 ◽

Cited By ~ 14

Author(s):

Barry Robson ◽

Srinidhi Boray

Keyword(s):

Data Mining ◽

Clinical Data ◽

Sparse Data ◽

Data Repositories ◽

Web Based

Download Full-text

Open source, web-based IAT

PsycEXTRA Dataset ◽

10.1037/e527772014-214 ◽

2011 ◽

Author(s):

Winter Mason

Keyword(s):

Open Source ◽

Web Based

Download Full-text

User Evaluation of an Integrated Medical Workstation for Clinical Data Analysis

Methods of Information in Medicine ◽

10.1055/s-0038-1634954 ◽

1993 ◽

Vol 32 (05) ◽

pp. 365-372 ◽

Cited By ~ 2

Author(s):

T. Timmeis ◽

J. H. van Bemmel ◽

E. M. van Mulligen

Keyword(s):

Data Analysis ◽

Clinical Research ◽

Clinical Data ◽

University Hospital ◽

User Evaluation ◽

Scientific Staff ◽

Medical Institutions ◽

Clinical Data Analysis ◽

Research Problems ◽

The University

AbstractResults are presented of the user evaluation of an integrated medical workstation for support of clinical research. Twenty-seven users were recruited from medical and scientific staff of the University Hospital Dijkzigt, the Faculty of Medicine of the Erasmus University Rotterdam, and from other Dutch medical institutions; and all were given a written, self-contained tutorial. Subsequently, an experiment was done in which six clinical data analysis problems had to be solved and an evaluation form was filled out. The aim of this user evaluation was to obtain insight in the benefits of integration for support of clinical data analysis for clinicians and biomedical researchers. The problems were divided into two sets, with gradually more complex problems. In the first set users were guided in a stepwise fashion to solve the problems. In the second set each stepwise problem had an open counterpart. During the evaluation, the workstation continuously recorded the user’s actions. From these results significant differences became apparent between clinicians and non-clinicians for the correctness (means 54% and 81%, respectively, p = 0.04), completeness (means 64% and 88%, respectively, p = 0.01), and number of problems solved (means 67% and 90%, respectively, p = 0.02). These differences were absent for the stepwise problems. Physicians tend to skip more problems than biomedical researchers. No statistically significant differences were found between users with and without clinical data analysis experience, for correctness (means 74% and 72%, respectively, p = 0.95), and completeness (means 82% and 79%, respectively, p = 0.40). It appeared that various clinical research problems can be solved easily with support of the workstation; the results of this experiment can be used as guidance for the development of the successor of this prototype workstation and serve as a reference for the assessment of next versions.

Download Full-text