Utility Analysis of Horizontally Merged Multi-Party Synthetic Data with Differential Privacy

Author(s):  
Bingyue Su ◽  
Fang Liu
2020 ◽  
Vol 69 ◽  
pp. 1127-1164
Author(s):  
Yuan Luo ◽  
Nicholas R. Jennings

In crowdsourcing systems, it is important for the crowdsource campaign initiator to incentivize users to share their data to produce results of the desired computational accuracy. This problem becomes especially challenging when users are concerned about the privacy of their data. To overcome this challenge, existing work often aims to provide users with differential privacy guarantees to incentivize privacy-sensitive users to share their data. However, this work neglects the network effect that a user enjoys greater privacy protection when he aligns his participation behaviour with that of other users. To explore this network effect, we formulate the interaction among users regarding their participation decisions as a population game, because a user’s welfare from the interaction depends not only on his own participation decision but also the distribution of others’ decisions. We show that the Nash equilibrium of this game consists of a threshold strategy, where all users whose privacy sensitivity is below a certain threshold will participate and the remaining users will not. We characterize the existence and uniqueness of this equilibrium, which depends on the privacy guarantee, the reward provided by the initiator and the population size. Based on this equilibria analysis, we design the PINE (Privacy Incentivization with Network Effects) mechanism and prove that it maximizes the initiator’s payoff while providing participating users with a guaranteed degree of privacy protection. Numerical simulations, on both real and synthetic data, show that (i) PINE improves the initiator’s expected payoff by up to 75%, compared to state of the art mechanisms that do not consider this effect; (ii) the performance gain by exploiting the network effect is particularly good when the majority of users are flexible over their privacy attitudes and when there are a large number of low quality task performers.


2018 ◽  
Vol 8 (1) ◽  
Author(s):  
Natalie Shlomo

An overview of traditional types of data dissemination at statistical agencies is provided including definitions of disclosure risks, the quantification of disclosure risk and data utility and common statistical disclosure limitation (SDL) methods. However, with technological advancements and the increasing push by governments for openand accessible data, new forms of data dissemination are currently being explored. We focus on web-based applications such as flexible table builders and remote analysis servers, synthetic data and remote access. Many of these applications introduce new challenges for statistical agencies as they are gradually relinquishing some of their control on what data is released. There is now more recognition of the need for perturbative methods to protect the confidentiality of data subjects. These new forms of data dissemination are changing the landscape of how disclosure risks are conceptualized and the types of SDL methods that need to be applied to protect thedata. In particular, inferential disclosure is the main disclosure risk of concern and encompasses the traditional types of disclosure risks based on identity and attribute disclosures. These challenges have led to statisticians exploring the computer science definition of differential privacy and privacy- by-design applications. We explore how differential privacy can be a useful addition to the current SDL framework within statistical agencies.


2021 ◽  
Vol 11 (3) ◽  
Author(s):  
Ryan McKenna ◽  
Gerome Miklau ◽  
Daniel Sheldon

We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.


2021 ◽  
Vol 14 (11) ◽  
pp. 2190-2202
Author(s):  
Kuntai Cai ◽  
Xiaoyu Lei ◽  
Jianxin Wei ◽  
Xiaokui Xiao

This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D , and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality of the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of counting queries and classification tasks conducted on the synthetic data generated.


2021 ◽  
Author(s):  
Bangzhou Xin ◽  
Yangyang Geng ◽  
Teng Hu ◽  
Sheng Chen ◽  
Wei Yang ◽  
...  

Author(s):  
Daniel Kifer ◽  
Bing-Rong Lin

"Privacy" and "utility" are words that frequently appear in the literature on statistical privacy. But what do these words really mean? In recent years, many problems with intuitive notions of privacy and utility have been uncovered. Thus more formal notions of privacy and utility, which are amenable to mathematical analysis, are needed. In this paper we present our initial work on an axiomatization of privacy and utility. We present two privacy axioms which describe how privacy is affected by post-processing data and by randomly selecting a privacy mechanism. We present three axioms for utility measures which also describe how measured utility is affected by post-processing. Our analysis of these axioms yields new insights into the construction of privacy definitions and utility measures. In particular, we characterize the class of relaxations of differential privacy that can be obtained by changing constraints on probabilities; we show that the resulting constraints must be formed from concave functions. We also present several classes of utility metrics satisfying our axioms and explicitly show that measures of utility borrowed from statistics can lead to utility paradoxes when applied to statistical privacy. Finally, we show that the outputs of differentially private algorithms are best interpreted in terms of graphs or likelihood functions rather than query answers or synthetic data.


2021 ◽  
Vol 2022 (1) ◽  
pp. 481-500
Author(s):  
Xue Jiang ◽  
Xuebing Zhou ◽  
Jens Grossklags

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.


Sign in / Sign up

Export Citation Format

Share Document