Analyses of Original and Computationally-Derived Electronic Health Record Data: The National COVID Cohort Collaborative (Preprint)
BACKGROUND Background: Synthetic data can be used by collaborators to generate and share data in support of answering critical research questions to address the COVID-19 pandemic. Computationally-derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record (EHR) data. OBJECTIVE Objectives: To compare the results of analyses using synthetic derivatives to analyses using the original data downloaded from a big-data platform with data-synthesizing capabilities (MDClone Ltd., Beer Sheva, Israel) to assess the strengths and limitations of leveraging computationally-derived data for research purposes. METHODS Methods: We used the National COVID Cohort Collaborative’s (N3C) instance of MDClone, comprising EHR data from 34 N3C institutional partners. We tested three use cases, including (1) exploring the distributions of key features of the COVID-positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-related measures and outcomes, and constructing their respective epidemic curves. We compared the results of analyses using synthetic derivatives to analyses using the original data using traditional statistics, machine learning approaches, temporal and spatial representations of the data. RESULTS Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. While the synthetic and original data yielded overall nearly the same results, there were exceptions which included an odds ratio on either side of the null in multivariable analyses (0.97 versus 1.01) and epidemic curves constructed for zip codes with low population counts. CONCLUSIONS Discussion & Conclusion: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights. CLINICALTRIAL N/A