scholarly journals Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

2021 ◽  
Vol 4 ◽  
Author(s):  
Michael Platzer ◽  
Thomas Reutterer

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

2019 ◽  
Vol 10 (1) ◽  
pp. 20190048 ◽  
Author(s):  
Wasiur R. KhudaBukhsh ◽  
Boseung Choi ◽  
Eben Kenah ◽  
Grzegorz A. Rempała

In this paper, we show that solutions to ordinary differential equations describing the large-population limits of Markovian stochastic epidemic models can be interpreted as survival or cumulative hazard functions when analysing data on individuals sampled from the population. We refer to the individual-level survival and hazard functions derived from population-level equations as a survival dynamical system (SDS). To illustrate how population-level dynamics imply probability laws for individual-level infection and recovery times that can be used for statistical inference, we show numerical examples based on synthetic data. In these examples, we show that an SDS analysis compares favourably with a complete-data maximum-likelihood analysis. Finally, we use the SDS approach to analyse data from a 2009 influenza A(H1N1) outbreak at Washington State University.


2020 ◽  
Author(s):  
Rich Colbaugh ◽  
Kristin Glass

AbstractThere is great interest in personalized medicine, in which treatment is tailored to the individual characteristics of patients. Achieving the objectives of precision healthcare will require clinically-grounded, evidence-based approaches, which in turn demands rigorous, scalable predictive analytics. Standard strategies for deriving prediction models for medicine involve acquiring ‘training’ data for large numbers of patients, labeling each patient according to the outcome of interest, and then using the labeled examples to learn to predict the outcome for new patients. Unfortunately, labeling individuals is time-consuming and expertise-intensive in medical applications and thus represents a major impediment to practical personalized medicine. We overcome this obstacle with a novel machine learning algorithm that enables individual-level prediction models to be induced from aggregate-level labeled data, which is readily-available in many health domains. The utility of the proposed learning methodology is demonstrated by: i.) leveraging US county-level mental health statistics to create a screening tool which detects individuals suffering from depression based upon their Twitter activity; ii.) designing a decision-support system that exploits aggregate clinical trials data on multiple sclerosis (MS) treatment to predict which therapy would work best for the presenting patient; iii.) employing group-level clinical trials data to induce a model able to find those MS patients likely to be helped by an experimental therapy.


Psych ◽  
2021 ◽  
Vol 3 (4) ◽  
pp. 703-716
Author(s):  
Thom Benjamin Volker ◽  
Gerko Vink

Synthetic datasets simultaneously allow for the dissemination of research data while protecting the privacy and confidentiality of respondents. Generating and analyzing synthetic datasets is straightforward, yet, a synthetic data analysis pipeline is seldom adopted by applied researchers. We outline a simple procedure for generating and analyzing synthetic datasets with the multiple imputation software mice (Version 3.13.15) in R. We demonstrate through simulations that the analysis results obtained on synthetic data yield unbiased and valid inferences and lead to synthetic records that cannot be distinguished from the true data records. The ease of use when synthesizing data with mice along with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research on initially private data.


Author(s):  
David R. McClure ◽  
Jerome P. Reiter

When releasing individual-level data to the public, statistical agencies typically alter data values to protect the confidentiality of individuals’ identities and sensitive attributes. When data undergo substantial perturbation, secondary data analysts’ inferences can be distorted in ways that they typically cannot determine from the released data alone. This is problematic, in that analysts have no idea if they should trust the results based on the altered data.To ameliorate this problem, agencies can establish verification servers, which are remote computers that analysts query for measures of the quality of inferences obtained from disclosure-protected data. The reported quality measures reflect the similarity between the analysis done with the altered data and the analysis done with the confidential data. However, quality measures can leak information about the confidential values, so that they too must be subject to disclosure protections. In this article, we discuss several approaches to releasing quality measures for verification servers when the public use data are generated via multiple imputation, also known as synthetic data. The methods can be modified for other stochastic perturbation methods.


2020 ◽  
Vol 10 (2) ◽  
Author(s):  
Aidong Adam Ding ◽  
Guanhong Miao ◽  
Samuel Shangwu Wu

Privacy protection is an important requirement in many statistical studies. A recently proposed data collection method, triple matrix-masking, retains exact summary statistics without exposing the raw data at any point in the process. In this paper, we provide theoretical formulation and proofs showing that a modified version of the procedure is strong collection obfuscating: no party in the data collection process is able to gain knowledge of the individual level data, even with some partially masked data information in addition to the publicly published data. This provides a theoretical foundation for the usage of such a procedure to collect masked data that allows exact statistical inference for linear models, while preserving a well-defined notion of privacy protection for each individual participant in the study. This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator. We achieve this by splitting the trust between two parties, the ``"masking service provider" and the ``"data collector."


2020 ◽  
Vol 51 (3) ◽  
pp. 183-198
Author(s):  
Wiktor Soral ◽  
Mirosław Kofta

Abstract. The importance of various trait dimensions explaining positive global self-esteem has been the subject of numerous studies. While some have provided support for the importance of agency, others have highlighted the importance of communion. This discrepancy can be explained, if one takes into account that people define and value their self both in individual and in collective terms. Two studies ( N = 367 and N = 263) examined the extent to which competence (an aspect of agency), morality, and sociability (the aspects of communion) promote high self-esteem at the individual and the collective level. In both studies, competence was the strongest predictor of self-esteem at the individual level, whereas morality was the strongest predictor of self-esteem at the collective level.


2019 ◽  
Vol 37 (1) ◽  
pp. 18-34
Author(s):  
Edward C. Warburton

This essay considers metonymy in dance from the perspective of cognitive science. My goal is to unpack the roles of metaphor and metonymy in dance thought and action: how do they arise, how are they understood, how are they to be explained, and in what ways do they determine a person's doing of dance? The premise of this essay is that language matters at the cultural level and can be determinative at the individual level. I contend that some figures of speech, especially metonymic labels like ‘bunhead’, can not only discourage but dehumanize young dancers, treating them not as subjects who dance but as objects to be danced. The use of metonymy to sort young dancers may undermine the development of healthy self-image, impede strong identity formation, and retard creative-artistic development. The paper concludes with a discussion of the influence of metonymy in dance and implications for dance educators.


Author(s):  
Pauline Oustric ◽  
Kristine Beaulieu ◽  
Nuno Casanova ◽  
Francois Husson ◽  
Catherine Gibbons ◽  
...  

2020 ◽  
Author(s):  
Christopher James Hopwood ◽  
Ted Schwaba ◽  
Wiebke Bleidorn

Personal concerns about climate change and the environment are a powerful motivator of sustainable behavior. People’s level of concern varies as a function of a variety of social and individual factors. Using data from 58,748 participants from a nationally representative German sample, we tested preregistered hypotheses about factors that impact concerns about the environment over time. We found that environmental concerns increased modestly from 2009-2017 in the German population. However, individuals in middle adulthood tended to be more concerned and showed more consistent increases in concern over time than younger or older people. Consistent with previous research, Big Five personality traits were correlated with environmental concerns. We present novel evidence that increases in concern were related to increases in the personality traits neuroticism and openness to experience. Indeed, changes in openness explained roughly 50% of the variance in changes in environmental concerns. These findings highlight the importance of understanding the individual level factors associated with changes in environmental concerns over time, towards the promotion of more sustainable behavior at the individual level.


Sign in / Sign up

Export Citation Format

Share Document