Improving Data Transparency and Accessibility in the Research Community through the Construction of Accurately Simulated Time-to-Event Datasets
Abstract BackgroundA lack of availability of data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, accurate time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.MethodsThis paper presents methods to accurately replicate the covariate patterns and survival times found in real-world datasets using simulation techniques, without compromising individual patient identifiability. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to simulate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.ResultsWe successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. ConclusionsWe evaluate the effectiveness of the simulation methods for constructing synthetic data, as well as providing evidence that it is almost impossible that a given patient from the original data could be identified from their individual unique date information. Simulated datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.