Data synthesis via differentially private markov random fields

2021 ◽  
Vol 14 (11) ◽  
pp. 2190-2202
Author(s):  
Kuntai Cai ◽  
Xiaoyu Lei ◽  
Jianxin Wei ◽  
Xiaokui Xiao

This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D , and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality of the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of counting queries and classification tasks conducted on the synthetic data generated.

2021 ◽  
Vol 11 (3) ◽  
Author(s):  
Ryan McKenna ◽  
Gerome Miklau ◽  
Daniel Sheldon

We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.


Symmetry ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 227
Author(s):  
Eckart Michaelsen ◽  
Stéphane Vujasinovic

Representative input data are a necessary requirement for the assessment of machine-vision systems. For symmetry-seeing machines in particular, such imagery should provide symmetries as well as asymmetric clutter. Moreover, there must be reliable ground truth with the data. It should be possible to estimate the recognition performance and the computational efforts by providing different grades of difficulty and complexity. Recent competitions used real imagery labeled by human subjects with appropriate ground truth. The paper at hand proposes to use synthetic data instead. Such data contain symmetry, clutter, and nothing else. This is preferable because interference with other perceptive capabilities, such as object recognition, or prior knowledge, can be avoided. The data are given sparsely, i.e., as sets of primitive objects. However, images can be generated from them, so that the same data can also be fed into machines requiring dense input, such as multilayered perceptrons. Sparse representations are preferred, because the author’s own system requires such data, and in this way, any influence of the primitive extraction method is excluded. The presented format allows hierarchies of symmetries. This is important because hierarchy constitutes a natural and dominant part in symmetry-seeing. The paper reports some experiments using the author’s Gestalt algebra system as symmetry-seeing machine. Additionally included is a comparative test run with the state-of-the-art symmetry-seeing deep learning convolutional perceptron of the PSU. The computational efforts and recognition performance are assessed.


2015 ◽  
Vol 23 (3) ◽  
pp. 303-313 ◽  
Author(s):  
Lianli Gao ◽  
Jingkuan Song ◽  
Xingyi Liu ◽  
Junming Shao ◽  
Jiajun Liu ◽  
...  

2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Binbin Zhang ◽  
Weiwei Wang ◽  
Xiangchu Feng

Subspace clustering aims to group a set of data from a union of subspaces into the subspace from which it was drawn. It has become a popular method for recovering the low-dimensional structure underlying high-dimensional dataset. The state-of-the-art methods construct an affinity matrix based on the self-representation of the dataset and then use a spectral clustering method to obtain the final clustering result. These methods show that sparsity and grouping effect of the affinity matrix are important in recovering the low-dimensional structure. In this work, we propose a weighted sparse penalty and a weighted grouping effect penalty in modeling the self-representation of data points. The experimental results on Extended Yale B, USPS, and Berkeley 500 image segmentation datasets show that the proposed model is more effective than state-of-the-art methods in revealing the subspace structure underlying high-dimensional dataset.


2021 ◽  
Vol 12 ◽  
Author(s):  
Jianping Zhao ◽  
Na Wang ◽  
Haiyun Wang ◽  
Chunhou Zheng ◽  
Yansen Su

Dimensionality reduction of high-dimensional data is crucial for single-cell RNA sequencing (scRNA-seq) visualization and clustering. One prominent challenge in scRNA-seq studies comes from the dropout events, which lead to zero-inflated data. To address this issue, in this paper, we propose a scRNA-seq data dimensionality reduction algorithm based on a hierarchical autoencoder, termed SCDRHA. The proposed SCDRHA consists of two core modules, where the first module is a deep count autoencoder (DCA) that is used to denoise data, and the second module is a graph autoencoder that projects the data into a low-dimensional space. Experimental results demonstrate that SCDRHA has better performance than existing state-of-the-art algorithms on dimension reduction and noise reduction in five real scRNA-seq datasets. Besides, SCDRHA can also dramatically improve the performance of data visualization and cell clustering.


2020 ◽  
Author(s):  
Fatima Zahra Errounda ◽  
Yan Liu

Abstract Location and trajectory data are routinely collected to generate valuable knowledge about users' pattern behavior. However, releasing location data may jeopardize the privacy of the involved individuals. Differential privacy is a powerful technique that prevents an adversary from inferring the presence or absence of an individual in the original data solely based on the observed data. The first challenge in applying differential privacy in location is that a it usually involves a single user. This shifts the adversary's target to the user's locations instead of presence or absence in the original data. The second challenge is that the inherent correlation between location data, due to people's movement regularity and predictability, gives the adversary an advantage in inferring information about individuals. In this paper, we review the differentially private approaches to tackle these challenges. Our goal is to help newcomers to the field to better understand the state-of-the art by providing a research map that highlights the different challenges in designing differentially private frameworks that tackle the characteristics of location data. We find that in protecting an individual's location privacy, the attention of differential privacy mechanisms shifts to preventing the adversary from inferring the original location based on the observed one. Moreover, we find that the privacy-preserving mechanisms make use of the predictability and regularity of users' movements to design and protect the users' privacy in trajectory data. Finally, we explore how well the presented frameworks succeed in protecting users' locations and trajectories against well-known privacy attacks.


Author(s):  
Shenghua Liu ◽  
Houdong Zheng ◽  
Huawei Shen ◽  
Xueqi Cheng ◽  
Xiangwen Liao

Whereas it is well known that social network users influence each other, a fundamental problem in influence maximization, opinion formation and viral marketing is that users' influences are difficult to quantify. Previous work has directly defined an independent model parameter to capture the interpersonal influence between each pair of users. However, such models do not consider how influences depend on each other if they originate from the same user or if they act on the same user. To do so, these models need a parameter for each pair of users, which results in high-dimensional models becoming easily trapped into the overfitting problem. Given these problems, another way of defining the parameters is needed to consider the dependencies. Thus we propose a model that defines parameters for every user with a latent influence vector and a susceptibility vector. Such low-dimensional and distributed representations naturally cause the interpersonal influences involving the same user to be coupled with each other, thus reducing the model's complexity. Additionally, the model can easily consider the sentimental polarities of users' messages and how sentiment affects users' influences. In this study, we conduct extensive experiments on real Microblog data, showing that our model with distributed representations achieves better accuracy than the state-of-the-art and pair-wise models, and that learning influences on sentiments benefit performance.


Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 389
Author(s):  
Sonali Parbhoo ◽  
Mario Wieser ◽  
Aleksander Wieczorek ◽  
Volker Roth

Estimating the effects of an intervention from high-dimensional observational data is a challenging problem due to the existence of confounding. The task is often further complicated in healthcare applications where a set of observations may be entirely missing for certain patients at test time, thereby prohibiting accurate inference. In this paper, we address this issue using an approach based on the information bottleneck to reason about the effects of interventions. To this end, we first train an information bottleneck to perform a low-dimensional compression of covariates by explicitly considering the relevance of information for treatment effects. As a second step, we subsequently use the compressed covariates to perform a transfer of relevant information to cases where data are missing during testing. In doing so, we can reliably and accurately estimate treatment effects even in the absence of a full set of covariate information at test time. Our results on two causal inference benchmarks and a real application for treating sepsis show that our method achieves state-of-the-art performance, without compromising interpretability.


2013 ◽  
Vol 303-306 ◽  
pp. 2412-2415
Author(s):  
Bo Chen ◽  
Yu Le Deng ◽  
Tie Ming Chen

The aim of dimensionality reduction is to construct a low-dimensional representation of high dimensional input data in such a way, that important parts of the structure of the input data are preserved. This paper proposes to apply the dimensionality reduction to intrusion detection data based on the parallel Lanczos-SVD (PLSVD) with the cloud technologies. The massive input data is stored on distribution files system, like HDFS. And the Map/Reduce method is used for the parallel analysis on many cluster nodes. Our experiment results show that, compared with the PCA algorithm, PLSVD algorithm has better scalability and flexibility.


Sign in / Sign up

Export Citation Format

Share Document