Improving Data Transparency and Accessibility in the Research Community through the Construction of Accurately Simulated Time-to-Event Datasets

Abstract BackgroundA lack of availability of data and statistical code being published alongside journal articles provides a significant barrier to open scientific discourse, and reproducibility of research. Information governance restrictions inhibit the active dissemination of individual level data to accompany published manuscripts. Realistic, accurate time-to-event synthetic data can aid in the acceleration of methodological developments in survival analysis and beyond by enabling researchers to access and test published methods using data similar to that which they were developed on.MethodsThis paper presents methods to accurately replicate the covariate patterns and survival times found in real-world datasets using simulation techniques, without compromising individual patient identifiability. We model the joint covariate distribution of the original data using covariate specific sequential conditional regression models, then fit a complex flexible parametric survival model from which to simulate survival times conditional on individual covariate patterns. We recreate the administrative censoring mechanism using the last observed follow-up date information from the initial dataset. Metrics for evaluating the accuracy of the synthetic data, and the non-identifiability of individuals from the original dataset, are presented.ResultsWe successfully create a synthetic version of an example colon cancer dataset consisting of 9064 patients which aims to show good similarity to both covariate distributions and survival times from the original data, without containing any exact information from the original data, therefore allowing them to be published openly alongside research. ConclusionsWe evaluate the effectiveness of the simulation methods for constructing synthetic data, as well as providing evidence that it is almost impossible that a given patient from the original data could be identified from their individual unique date information. Simulated datasets using this methodology could be made available alongside published research without breaching data privacy protocols, and allow for data and code to be made available alongside methodological or applied manuscripts to greatly improve the transparency and accessibility of medical research.

Download Full-text

Application of Bayesian networks to generate synthetic health data

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa303 ◽

2020 ◽

Author(s):

Dhamanpreet Kaur ◽

Matthew Sobiesk ◽

Shubham Patil ◽

Jin Liu ◽

Puran Bhagat ◽

...

Keyword(s):

Machine Learning ◽

Bayesian Networks ◽

Data Privacy ◽

Statistical Tests ◽

Synthetic Data ◽

Original Data ◽

Health Data ◽

Minimal Risk ◽

Data Types ◽

Automated Method

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.

Download Full-text

Generation of Synthetic Data with Conditional Generative Adversarial Networks

Logic Journal of IGPL ◽

10.1093/jigpal/jzaa059 ◽

2020 ◽

Author(s):

Belén Vega-Márquez ◽

Cristina Rubio-Escudero ◽

Isabel Nepomuceno-Chamorro

Keyword(s):

Research Work ◽

Synthetic Data ◽

Original Data ◽

Classification Problem ◽

Generative Adversarial Networks ◽

Data Generation ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Adversarial Networks ◽

Original Dataset

Abstract The generation of synthetic data is becoming a fundamental task in the daily life of any organization due to the new protection data laws that are emerging. Because of the rise in the use of Artificial Intelligence, one of the most recent proposals to address this problem is the use of Generative Adversarial Networks (GANs). These types of networks have demonstrated a great capacity to create synthetic data with very good performance. The goal of synthetic data generation is to create data that will perform similarly to the original dataset for many analysis tasks, such as classification. The problem of GANs is that in a classification problem, GANs do not take class labels into account when generating new data, it is treated as any other attribute. This research work has focused on the creation of new synthetic data from datasets with different characteristics with a Conditional Generative Adversarial Network (CGAN). CGANs are an extension of GANs where the class label is taken into account when the new data is generated. The performance of our results has been measured in two different ways: firstly, by comparing the results obtained with classification algorithms, both in the original datasets and in the data generated; secondly, by checking that the correlation between the original data and those generated is minimal.

Download Full-text

Privacy-preserving generative deep neural networks support clinical data sharing

10.1101/159756 ◽

2017 ◽

Cited By ~ 20

Author(s):

Brett K. Beaulieu-Jones ◽

Zhiwei Steven Wu ◽

Chris Williams ◽

Ran Lee ◽

Sanjeev P. Bhavnani ◽

...

Keyword(s):

Neural Networks ◽

Data Sharing ◽

Deep Neural Networks ◽

Differential Privacy ◽

Synthetic Data ◽

Scientific Progress ◽

Patient Privacy ◽

Individual Level ◽

Original Dataset ◽

Level Data

AbstractBackgroundData sharing accelerates scientific progress but sharing individual level data while preserving patient privacy presents a barrier.Methods and ResultsUsing pairs of deep neural networks, we generated simulated, synthetic “participants” that closely resemble participants of the SPRINT trial. We showed that such paired networks can be trained with differential privacy, a formal privacy framework that limits the likelihood that queries of the synthetic participants’ data could identify a real a participant in the trial. Machine-learning predictors built on the synthetic population generalize to the original dataset. This finding suggests that the synthetic data can be shared with others, enabling them to perform hypothesis-generating analyses as though they had the original trial data.ConclusionsDeep neural networks that generate synthetic participants facilitate secondary analyses and reproducible investigation of clinical datasets by enhancing data sharing while preserving participant privacy.

Download Full-text

Synthetic data use: exploring use cases to optimise data utility

Discover Artificial Intelligence ◽

10.1007/s44163-021-00016-y ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Stefanie James ◽

Chris Harbron ◽

Janice Branson ◽

Mimmi Sundler

Keyword(s):

Pharmaceutical Industry ◽

Data Privacy ◽

Synthetic Data ◽

Use Cases ◽

Data Utility ◽

Original Dataset ◽

Simulation Based ◽

Synthetic Datasets ◽

Future Direction

AbstractSynthetic data is a rapidly evolving field with growing interest from multiple industry stakeholders and European bodies. In particular, the pharmaceutical industry is starting to realise the value of synthetic data which is being utilised more prevalently as a method to optimise data utility and sharing, ultimately as an innovative response to the growing demand for improved privacy. Synthetic data is data generated by simulation, based upon and mirroring properties of an original dataset. Here, with supporting viewpoints from across the pharmaceutical industry, we set out to explore use cases for synthetic data across seven key but relatable areas for optimising data utility for improved data privacy and protection. We also discuss the various methods which can be used to produce a synthetic dataset and availability of metrics to ensure robust quality of generated synthetic datasets. Lastly, we discuss the potential merits, challenges and future direction of synthetic data within the pharmaceutical industry and the considerations for this privacy enhancing technology.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Risk Projection for Time-to-event Outcome Leveraging Summary Statistics With Source Individual-level Data

Journal of the American Statistical Association ◽

10.1080/01621459.2021.1895810 ◽

2021 ◽

pp. 1-34

Author(s):

Jiayin Zheng ◽

Yingye Zheng ◽

Li Hsu

Keyword(s):

Summary Statistics ◽

Time To Event ◽

Individual Level ◽

Level Data ◽

Risk Projection

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

The Gender Outcomes International Group: to Further Well-being Development (Going-fwd) Methodology on Identification and Inclusion of Gender Factors in Retrospective Cohort Studies

10.21203/rs.3.rs-51246/v1 ◽

2020 ◽

Author(s):

Valeria Raparelli Raparelli ◽

Colleen M. Norris ◽

Uri Bender ◽

Maria Trinidad Herrero ◽

Alexandra Kautzky-Willer ◽

...

Keyword(s):

Data Structure ◽

Meta Analysis ◽

Synthetic Data ◽

Well Being ◽

Pooled Analysis ◽

Research Network ◽

Local Analysis ◽

Related Factors ◽

Individual Level ◽

Definition Of

Abstract Background: Gender refers to the socially constructed roles, behaviors, expressions, and identities of girls, women, boys, men, and gender diverse people. It influences self-perception, individual’s actions and interactions, as well as the distribution of power and resources in society. Gender-related factors are seldom assessed as determinants of health outcomes, despite their powerful contribution.Methods: Investigators of the GOING-FWD project developed a standard methodology applicable for observational studies to retrospectively identify gender-related factors to assess their relationship to outcomes and applied this method to selected cohorts of non-communicable chronic diseases from Austria, Canada, Spain, Sweden.Results: The following multistep process was applied. Step 1 (Identification of Gender-related Variables): Based on the gender framework of the Women Health Research Network (i.e. gender identity, role, relations, and institutionalized gender), and available literature for a certain disease, an optimal “wish-list” of gender-related variables/factors was created and discussed by experts. Step 2 (Definition of Outcomes): each of the cohort data dictionaries were screened for clinical and patient relevant outcomes, using the ICHOM framework. Step 3 (Building of Feasible Final List): A cross-validation between gender-related and outcome variables available per database and the “wish-list” was performed. Step 4 (Retrospective Data Harmonization): The harmonization potential of variables was evaluated. Step 5 (Definition of Data Structure and Analysis): Depending on the database data structure, the following analytic strategies were identified: (1) local analysis of data not transferable followed by a meta-analysis combining study-level estimates; (2) centrally performed federated analysis of anonymized data, with the individual-level participant data remaining on local servers; (3) synthesizing the data locally and performing a pooled analysis on the synthetic data; and (4) central analysis of pooled transferable data.Conclusion: The application of the GOING-FWD systematic multistep approach can help guide investigators to analyze gender and its impact on outcomes in previously collected data.

Download Full-text

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Frontiers in Big Data ◽

10.3389/fdata.2021.679939 ◽

2021 ◽

Vol 4 ◽

Author(s):

Michael Platzer ◽

Thomas Reutterer

Keyword(s):

Mixed Type ◽

Synthetic Data ◽

Training Data ◽

Privacy Risk ◽

Individual Level ◽

Empirical Assessment ◽

Model Free ◽

Private Data ◽

Synthetic Datasets ◽

The Individual

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Download Full-text

Privacy Preservation using (L, D) Inference Model Based on Dependency Identification Information Gain

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1196.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1170-1173

Keyword(s):

Data Mining ◽

Information Gain ◽

Original Data ◽

Perturbation Approach ◽

Sensitive Information ◽

Functional Dependencies ◽

Inference Model ◽

Data Set ◽

Data Mining Techniques ◽

Original Dataset

The improvement of an information processing and Memory capacity, the vast amount of data is collected for various data analyses purposes. Data mining techniques are used to get knowledgeable information. The process of extraction of data by using data mining techniques the data get discovered publically and this leads to breaches of specific privacy data. Privacypreserving data mining is used to provide to protection of sensitive information from unwanted or unsanctioned disclosure. In this paper, we analysis the problem of discovering similarity checks for functional dependencies from a given dataset such that application of algorithm (l, d) inference with generalization can anonymised the micro data without loss in utility. [8] This work has presented Functional dependency based perturbation approach which hides sensitive information from the user, by applying (l, d) inference model on the dependency attributes based on Information Gain. This approach works on both categorical and numerical attributes. The perturbed data set does not affects the original dataset it maintains the same or very comparable patterns as the original data set. Hence the utility of the application is always high, when compared to other data mining techniques. The accuracy of the original and perturbed datasets is compared and analysed using tools, data mining classification algorithm.

Download Full-text