Anonymiced Shareable Data: Using mice to Create and Analyze Multiply Imputed Synthetic Datasets

Synthetic datasets simultaneously allow for the dissemination of research data while protecting the privacy and confidentiality of respondents. Generating and analyzing synthetic datasets is straightforward, yet, a synthetic data analysis pipeline is seldom adopted by applied researchers. We outline a simple procedure for generating and analyzing synthetic datasets with the multiple imputation software mice (Version 3.13.15) in R. We demonstrate through simulations that the analysis results obtained on synthetic data yield unbiased and valid inferences and lead to synthetic records that cannot be distinguished from the true data records. The ease of use when synthesizing data with mice along with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research on initially private data.

Download Full-text

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Frontiers in Big Data ◽

10.3389/fdata.2021.679939 ◽

2021 ◽

Vol 4 ◽

Author(s):

Michael Platzer ◽

Thomas Reutterer

Keyword(s):

Mixed Type ◽

Synthetic Data ◽

Training Data ◽

Privacy Risk ◽

Individual Level ◽

Empirical Assessment ◽

Model Free ◽

Private Data ◽

Synthetic Datasets ◽

The Individual

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Download Full-text

Privacy and Synthetic Datasets

10.31228/osf.io/bfqh3 ◽

2018 ◽

Cited By ~ 1

Author(s):

Steven M. Bellovin ◽

Preetam Dutta ◽

Nathan Reitinger

Keyword(s):

New World ◽

Data Dissemination ◽

Differential Privacy ◽

Synthetic Data ◽

Scientific Progress ◽

Legal Implications ◽

Database Privacy ◽

Synthetic Datasets ◽

Privacy Problem ◽

Historic Approach

Sharing is a virtue, instilled in us from childhood. Unfortunately, when it comes to big data—i.e., databases possessing the potential to usher in a whole new world of scientific progress—the legal landscape prefers a hoggish motif. The historic approach to the resulting database–privacy problem has been anonymization, a subtractive technique incurring not only poor privacy results, but also lackluster utility. In anonymization’s stead, differential privacy arose; it provides better, near-perfect privacy, but is nonetheless subtractive in terms of utility. Today, another solution is leaning into the fore, synthetic data. Using the magic of machine learning, synthetic data offers a generative, additive approach—the creation of almost-but-not-quite replica data. In fact, as we recommend, synthetic data may be combined with differential privacy to achieve a best-of-both-worlds scenario. After unpacking the technical nuances of synthetic data, we analyze its legal implications, finding both over and under inclusive applications. Privacy statutes either overweigh or downplay the potential for synthetic data to leak secrets, inviting ambiguity. We conclude by finding that synthetic data is a valid, privacy-conscious alternative to raw data, but is not a cure-all for every situation. In the end, computer science progress must be met with proper policy in order to move the area of useful data dissemination forward.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Inferring viral occurrence patterns through a synthetic data simulation

10.1101/2021.07.13.452220 ◽

2021 ◽

Author(s):

Ville N Pimenoff ◽

Ramon Cleries

Keyword(s):

Linear Models ◽

Population Sample ◽

Synthetic Data ◽

Interaction Patterns ◽

Viral Strain ◽

Data Simulation ◽

Synthetic Datasets ◽

Pathogen Occurrence ◽

Log Linear ◽

Occurrence Patterns

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

Download Full-text

Synthetic Data

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-040720-031848 ◽

2020 ◽

Vol 8 (1) ◽

Author(s):

Trivellore E. Raghunathan

Keyword(s):

Synthetic Data ◽

Future Research ◽

Annual Review ◽

Publication Date ◽

Data Sets ◽

Sensitive Information ◽

Inferential Justification ◽

Widespread Access ◽

Access To Data ◽

Privacy And Confidentiality

Demand for access to data, especially data collected using public funds, is ever growing. At the same time, concerns about the disclosure of the identities of and sensitive information about the respondents providing the data are making the data collectors limit the access to data. Synthetic data sets, generated to emulate certain key information found in the actual data and provide the ability to draw valid statistical inferences, are an attractive framework to afford widespread access to data for analysis while mitigating privacy and confidentiality concerns. The goal of this article is to provide a review of various approaches for generating and analyzing synthetic data sets, inferential justification, limitations of the approaches, and directions for future research. Expected final online publication date for the Annual Review of Statistics, Volume 8 is March 8, 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Revisiting particle sizing using greyscale optical array probes: evaluation using laboratory experiments and synthetic data

Atmospheric Measurement Techniques ◽

10.5194/amt-12-3067-2019 ◽

2019 ◽

Vol 12 (6) ◽

pp. 3067-3079

Author(s):

Sebastian J. O'Shea ◽

Jonathan Crosier ◽

James Dorsey ◽

Waldemar Schledewitz ◽

Ian Crawford ◽

...

Keyword(s):

Climate Models ◽

Synthetic Data ◽

Mie Scattering ◽

Sample Volume ◽

Research Aircraft ◽

Order Of Magnitude ◽

Ambient Data ◽

In Situ Observations ◽

Synthetic Datasets

Abstract. In situ observations from research aircraft and instrumented ground sites are important contributions to developing our collective understanding of clouds and are used to inform and validate numerical weather and climate models. Unfortunately, biases in these datasets may be present, which can limit their value. In this paper, we discuss artefacts which may bias data from a widely used family of instrumentation in the field of cloud physics, optical array probes (OAPs). Using laboratory and synthetic datasets, we demonstrate how greyscale analysis can be used to filter data, constraining the sample volume of the OAP and improving data quality, particularly at small sizes where OAP data are considered unreliable. We apply the new methodology to ambient data from two contrasting case studies: one warm cloud and one cirrus cloud. In both cases the new methodology reduces the concentration of small particles (<60 µm) by approximately an order of magnitude. This significantly improves agreement with a Mie-scattering spectrometer for the liquid case and with a holographic imaging probe for the cirrus case. Based on these results, we make specific recommendations to instrument manufacturers, instrument operators and data processors about the optimal use of greyscale OAPs. The data from monoscale OAPs are unreliable and should not be used for particle diameters below approximately 100 µm.

Download Full-text

Validation of XRD phase quantification using semi-synthetic data

Powder Diffraction ◽

10.1017/s0885715620000573 ◽

2020 ◽

Vol 35 (4) ◽

pp. 262-275

Author(s):

Nicola Döbelin

Keyword(s):

Reference Materials ◽

Certified Reference Materials ◽

Synthetic Data ◽

Statistical Validation ◽

Validation Parameters ◽

Phase Quantification ◽

Characteristic Features ◽

Diffraction Patterns ◽

Synthetic Datasets ◽

Multi Phase

Validating phase quantification procedures of powder X-ray diffraction (XRD) data for an implementation in an ISO/IEC 17025 accredited environment has been challenging due to a general lack of suitable certified reference materials. The preparation of highly pure and crystalline reference materials and mixtures thereof may exceed the costs for a profitable and justifiable implementation. This study presents a method for the validation of XRD phase quantifications based on semi-synthetic datasets that reduces the effort for a full method validation drastically. Datasets of nearly pure reference substances are stripped of impurity signals and rescaled to 100% crystallinity, thus eliminating the need for the preparation of ultra-pure and -crystalline materials. The processed datasets are then combined numerically while preserving all sample- and instrument-characteristic features of the peak profile, thereby creating multi-phase diffraction patterns of precisely known composition. The number of compositions and repetitions is only limited by computational power and storage capacity. These datasets can be used as input files for the phase quantification procedure, in which statistical validation parameters such as precision, accuracy, linearity, and limits of detection and quantification can be determined from a statistically sound number of datasets and compositions.

Download Full-text

State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity

BioMed Research International ◽

10.1155/2013/340620 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 57

Author(s):

Matteo Carrara ◽

Marco Beccuti ◽

Fulvio Lazzarato ◽

Federica Cavallo ◽

Francesca Cordero ◽

...

Keyword(s):

Synthetic Data ◽

Rna Seq ◽

Comparison Analysis ◽

Generating Functional ◽

Specificity And Sensitivity ◽

Sensitive Tool ◽

Fusion Junction ◽

Synthetic Datasets ◽

Very High ◽

Fusion Detection

Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way.Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions.Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms.

Download Full-text

Revisiting particle sizing using grayscale optical array probes evaluation using laboratory experiments and synthetic data

10.5194/amt-2018-435 ◽

2019 ◽

Author(s):

Sebastian J. O'Shea ◽

Jonathan Crosier ◽

James Dorsey ◽

Waldemar Schledewitz ◽

Ian Crawford ◽

...

Keyword(s):

Climate Models ◽

Synthetic Data ◽

Mie Scattering ◽

Sample Volume ◽

Research Aircraft ◽

Order Of Magnitude ◽

Ambient Data ◽

In Situ Observations ◽

Synthetic Datasets

Abstract. In-situ observations from research aircraft and instrumented ground sites are important contributions to developing our collective understanding of clouds, and are used to inform and validate numerical weather and climate models. Unfortunately, biases in these datasets may be present, which can limit their value. In this paper, we discuss artefacts which may bias data from a widely used family of instrumentation in the field of cloud physics, Optical Array Probes (OAPs). Using laboratory and synthetic datasets, we demonstrate how greyscale analysis can be used to filter data, constraining the sample volume of the OAP, and improving data quality particularly at small sizes where OAP data are considered unreliable. We apply the new methodology to ambient data from two contrasting case studies: one warm cloud and one cirrus cloud. In both cases the new methodology reduces the concentration of small particles (< 60 µm) by approximately an order of magnitude. This significantly improves agreement with a Mie scattering spectrometer for the liquid case and with a holographic imaging probe for the cirrus case. Based on these results, we make specific recommendations to instrument manufacturers, instrument operators, and data processors about the optimal use of greyscale OAP’s. We also raise the issue of bias in OAP’s which have no greyscale capability.

Download Full-text

An easy-to-assemble, robust, and lightweight drive implant for chronic tetrode recordings in freely moving animals

10.1101/746651 ◽

2019 ◽

Author(s):

Jakob Voigts ◽

Jonathan P. Newman ◽

Matthew A. Wilson ◽

Mark T. Harnett

Keyword(s):

Ease Of Use ◽

Gold Standard Method ◽

Tree Shrews ◽

New Class ◽

Low Weight ◽

Build Time ◽

Freely Behaving ◽

Freely Moving ◽

Data Yield ◽

Tetrode Recordings

AbstractTetrode arrays are the gold-standard method for neuronal recordings in many studies with behaving animals, especially for deep structures and chronic recordings. Here we outline an improved drive design for use in freely behaving animals. Our design makes use of recently developed technologies to reduce the complexity and build time of the drive while maintaining a low weight. The design also presents an improvement over many existing designs in terms of robustness and ease of use. We describe two variants: a 16 tetrode implant weighing ∼2 g for mice, bats, tree shrews and similar animals, and a 64 tetrode implant weighing ∼16 g for rats, and similar animals.These designs were co-developed and optimized alongside a new class of drive-mounted feature-rich amplifier boards with ultra-thin RF tethers, as described in an upcoming paper (Newman, Zhang et al., in prep). This design significantly improves the data yield of chronic electrophysiology experiments.

Download Full-text