A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms

Differential privacy has emerged as a popular model to provably limit privacy risks associated with a given data release. However releasing high dimensional synthetic data under differential privacy remains a challenging problem. In this paper, we study the problem of releasing synthetic data in the form of a high dimensional histogram under the constraint of differential privacy.We develop an $(\epsilon, \delta)$-differentially private categorical data synthesizer called \emph{Stability Based Hashed Gibbs Sampler} (SBHG). SBHG works by combining a stability based sparse histogram estimation algorithm with Gibbs sampling and feature selection to approximate the empirical joint distribution of a discrete dataset. SBHG offers a competitive alternative to state-of-the art synthetic data generators while preserving the sparsity structure of the original dataset, which leads to improved statistical utility as illustrated on simulated data. Finally, to study the utility of the resulting synthetic data sets generated by SBHG, we also perform logistic regression using the synthetic datasets and compare the classification accuracy with those from using the original dataset.

Download Full-text

Using Self-Similarity to Incorporate Dimensionality Reduction and Cluster Evolution Tracking

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.336-338.2242 ◽

2013 ◽

Vol 336-338 ◽

pp. 2242-2247

Author(s):

Guang Hui Yan ◽

Yong Chen ◽

Hong Yun Zhao ◽

Ya Jin Ren ◽

Zhi Cheng Ma

Keyword(s):

Dimensionality Reduction ◽

Synthetic Data ◽

Fractal Dimensionality ◽

High Dimensional ◽

Data Sets ◽

Stream Data ◽

Self Similarity ◽

Cluster Evolution ◽

Dimensionality Reduction Technique ◽

On Line

Cluster evolution tracking and dimensionality reduction have been studied intensively but separately in the time decayed and high dimensional stream data environment during the past decades. However, the interaction between the cluster evolution and the dimensionality reduction is the most common scenario in the time decayed stream data. Therefore, the dimensionality reduction should interact with cluster operation in the endless life cycle of stream data. In this paper, we first investigate the interaction between dimensionality reduction and cluster evolution in the high dimensional time decayed stream data. Then, we integrate the on-line sequential forward fractal dimensionality reduction technique with self-adaptive technique for cluster evolution tracking based on multi-fractal. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness and efficiency provided by our approach.

Download Full-text

Local Differential Privacy Protection of High-Dimensional Perceptual Data by the Refined Bayes Network

Sensors ◽

10.3390/s20092516 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2516

Author(s):

Chunhua Ju ◽

Qiuyang Gu ◽

Gongxing Wu ◽

Shuangzhu Zhang

Keyword(s):

Privacy Protection ◽

Data Privacy ◽

Differential Privacy ◽

High Dimensional Data ◽

Statistical Characteristics ◽

High Dimensional ◽

Bayes Network ◽

Crowd Sensing ◽

Original Dataset ◽

Perception System

Although the Crowd-Sensing perception system brings great data value to people through the release and analysis of high-dimensional perception data, it causes great hidden danger to the privacy of participants in the meantime. Currently, various privacy protection methods based on differential privacy have been proposed, but most of them cannot simultaneously solve the complex attribute association problem between high-dimensional perception data and the privacy threat problems from untrustworthy servers. To address this problem, we put forward a local privacy protection based on Bayes network for high-dimensional perceptual data in this paper. This mechanism realizes the local data protection of the users at the very beginning, eliminates the possibility of other parties directly accessing the user’s original data, and fundamentally protects the user’s data privacy. During this process, after receiving the data of the user’s local privacy protection, the perception server recognizes the dimensional correlation of the high-dimensional data based on the Bayes network, divides the high-dimensional data attribute set into multiple relatively independent low-dimensional attribute sets, and then sequentially synthesizes the new dataset. It can effectively retain the attribute dimension correlation of the original perception data, and ensure that the synthetic dataset and the original dataset have as similar statistical characteristics as possible. To verify its effectiveness, we conduct a multitude of simulation experiments. Results have shown that the synthetic data of this mechanism under the effective local privacy protection has relatively high data utility.

Download Full-text

Differential Privacy Principal Component Analysis for Support Vector Machines

Security and Communication Networks ◽

10.1155/2021/5542283 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Yuxian Huang ◽

Geng Yang ◽

Yahong Xu ◽

Hao Zhou

Keyword(s):

Principal Component Analysis ◽

Classification Accuracy ◽

Differential Privacy ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Fast Classification

In big data era, massive and high-dimensional data is produced at all times, increasing the difficulty of analyzing and protecting data. In this paper, in order to realize dimensionality reduction and privacy protection of data, principal component analysis (PCA) and differential privacy (DP) are combined to handle these data. Moreover, support vector machine (SVM) is used to measure the availability of processed data in our paper. Specifically, we introduced differential privacy mechanisms at different stages of the algorithm PCA-SVM and obtained the algorithms DPPCA-SVM and PCADP-SVM. Both algorithms satisfy ε , 0 -DP while achieving fast classification. In addition, we evaluate the performance of two algorithms in terms of noise expectation and classification accuracy from the perspective of theoretical proof and experimental verification. To verify the performance of DPPCA-SVM, we also compare our DPPCA-SVM with other algorithms. Results show that DPPCA-SVM provides excellent utility for different data sets despite guaranteeing stricter privacy.

Download Full-text

Winning the NIST Contest: A scalable and general approach to differentially private synthetic data

Journal of Privacy and Confidentiality ◽

10.29012/jpc.778 ◽

2021 ◽

Vol 11 (3) ◽

Author(s):

Ryan McKenna ◽

Gerome Miklau ◽

Daniel Sheldon

Keyword(s):

Differential Privacy ◽

Data Distribution ◽

Synthetic Data ◽

Processing Method ◽

High Dimensional ◽

Data Generation ◽

Synthetic Data Generation ◽

Low Dimensional ◽

Noisy Measurements ◽

Broad Interest

We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.

Download Full-text

Data synthesis via differentially private markov random fields

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476272 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2190-2202

Author(s):

Kuntai Cai ◽

Xiaoyu Lei ◽

Jianxin Wei ◽

Xiaokui Xiao

Keyword(s):

Markov Random Fields ◽

Input Data ◽

Differential Privacy ◽

State Of The Art ◽

Synthetic Data ◽

The State ◽

High Dimensional ◽

Data Synthesis ◽

Markov Random ◽

Low Dimensional

This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D , and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality of the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of counting queries and classification tasks conducted on the synthetic data generated.

Download Full-text

Nonparametric e-Mixture Estimation

Neural Computation ◽

10.1162/neco_a_00888 ◽

2016 ◽

Vol 28 (12) ◽

pp. 2687-2725 ◽

Cited By ~ 4

Author(s):

Ken Takano ◽

Hideitsu Hino ◽

Shotaro Akaho ◽

Noboru Murata

Keyword(s):

Synthetic Data ◽

Estimation Algorithm ◽

Data Sets ◽

Real World Data ◽

Nonparametric Modeling ◽

Data Set ◽

Target Distribution ◽

Nonparametric Models ◽

Research Fields ◽

Using Data

This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the [Formula: see text]- and [Formula: see text]-mixtures. The [Formula: see text]-mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the [Formula: see text]-mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The [Formula: see text]-mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the [Formula: see text]-mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.

Download Full-text

Differentially private density estimation with skew-normal mixtures model

Scientific Reports ◽

10.1038/s41598-021-90276-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Weisan Wu

Keyword(s):

Density Estimation ◽

Differential Privacy ◽

Real Data ◽

Estimation Algorithm ◽

Data Sets ◽

Normal Mixtures ◽

Estimated Parameters ◽

Private Data ◽

Asymmetric Problem ◽

Skew Normal

AbstractThe protection of private data is a hot research issue in the era of big data. Differential privacy is a strong privacy guarantees in data analysis. In this paper, we propose DP-MSNM, a parametric density estimation algorithm using multivariate skew-normal mixtures (MSNM) model to differential privacy. MSNM can solve the asymmetric problem of data sets, and it is could approximate any distribution through expectation–maximization (EM) algorithm. In this model, we add two extra steps on the estimated parameters in the M step of each iteration. The first step is adding calibrated noise to the estimated parameters based on Laplacian mechanism. The second step is post-processes those noisy parameters to ensure their intrinsic characteristics based on the theory of vector normalize and positive semi definition matrix. Extensive experiments using both real data sets evaluate the performance of DP-MSNM, and demonstrate that the proposed method outperforms DPGMM.

Download Full-text

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2022-0024 ◽

2021 ◽

Vol 2022 (1) ◽

pp. 481-500

Author(s):

Xue Jiang ◽

Xuebing Zhou ◽

Jens Grossklags

Keyword(s):

Machine Learning ◽

Data Collection ◽

Private Information ◽

Differential Privacy ◽

High Dimensional Data ◽

Synthetic Data ◽

Privacy Preserving ◽

High Dimensional ◽

Data Utility ◽

High Utility

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

Download Full-text

Privacy-preserving generative deep neural networks support clinical data sharing

10.1101/159756 ◽

2017 ◽

Cited By ~ 20

Author(s):

Brett K. Beaulieu-Jones ◽

Zhiwei Steven Wu ◽

Chris Williams ◽

Ran Lee ◽

Sanjeev P. Bhavnani ◽

...

Keyword(s):

Neural Networks ◽

Data Sharing ◽

Deep Neural Networks ◽

Differential Privacy ◽

Synthetic Data ◽

Scientific Progress ◽

Patient Privacy ◽

Individual Level ◽

Original Dataset ◽

Level Data

AbstractBackgroundData sharing accelerates scientific progress but sharing individual level data while preserving patient privacy presents a barrier.Methods and ResultsUsing pairs of deep neural networks, we generated simulated, synthetic “participants” that closely resemble participants of the SPRINT trial. We showed that such paired networks can be trained with differential privacy, a formal privacy framework that limits the likelihood that queries of the synthetic participants’ data could identify a real a participant in the trial. Machine-learning predictors built on the synthetic population generalize to the original dataset. This finding suggests that the synthetic data can be shared with others, enabling them to perform hypothesis-generating analyses as though they had the original trial data.ConclusionsDeep neural networks that generate synthetic participants facilitate secondary analyses and reproducible investigation of clinical datasets by enhancing data sharing while preserving participant privacy.

Download Full-text

Synthetic data use: exploring use cases to optimise data utility

Discover Artificial Intelligence ◽

10.1007/s44163-021-00016-y ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Stefanie James ◽

Chris Harbron ◽

Janice Branson ◽

Mimmi Sundler

Keyword(s):

Pharmaceutical Industry ◽

Data Privacy ◽

Synthetic Data ◽

Use Cases ◽

Data Utility ◽

Original Dataset ◽

Simulation Based ◽

Synthetic Datasets ◽

Future Direction

AbstractSynthetic data is a rapidly evolving field with growing interest from multiple industry stakeholders and European bodies. In particular, the pharmaceutical industry is starting to realise the value of synthetic data which is being utilised more prevalently as a method to optimise data utility and sharing, ultimately as an innovative response to the growing demand for improved privacy. Synthetic data is data generated by simulation, based upon and mirroring properties of an original dataset. Here, with supporting viewpoints from across the pharmaceutical industry, we set out to explore use cases for synthetic data across seven key but relatable areas for optimising data utility for improved data privacy and protection. We also discuss the various methods which can be used to produce a synthetic dataset and availability of metrics to ensure robust quality of generated synthetic datasets. Lastly, we discuss the potential merits, challenges and future direction of synthetic data within the pharmaceutical industry and the considerations for this privacy enhancing technology.

Download Full-text