Differentially Private Data Sets Based on Microaggregation and Record Perturbation

Author(s):  
Jordi Soria-Comas ◽  
Josep Domingo-Ferrer
Keyword(s):  
2019 ◽  
Author(s):  
Sven Festag ◽  
Cord Spreckelsen

BACKGROUND Collaborative privacy-preserving training methods allow for the integration of locally stored private data sets into machine learning approaches while ensuring confidentiality and nondisclosure. OBJECTIVE In this work we assess the performance of a state-of-the-art neural network approach for the detection of protected health information in texts trained in a collaborative privacy-preserving way. METHODS The training adopts distributed selective stochastic gradient descent (ie, it works by exchanging local learning results achieved on private data sets). Five networks were trained on separated real-world clinical data sets by using the privacy-protecting protocol. In total, the data sets contain 1304 real longitudinal patient records for 296 patients. RESULTS These networks reached a mean F1 value of 0.955. The gold standard centralized training that is based on the union of all sets and does not take data security into consideration reaches a final value of 0.962. CONCLUSIONS Using real-world clinical data, our study shows that detection of protected health information can be secured by collaborative privacy-preserving training. In general, the approach shows the feasibility of deep learning on distributed and confidential clinical data while ensuring data protection.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Weisan Wu

AbstractThe protection of private data is a hot research issue in the era of big data. Differential privacy is a strong privacy guarantees in data analysis. In this paper, we propose DP-MSNM, a parametric density estimation algorithm using multivariate skew-normal mixtures (MSNM) model to differential privacy. MSNM can solve the asymmetric problem of data sets, and it is could approximate any distribution through expectation–maximization (EM) algorithm. In this model, we add two extra steps on the estimated parameters in the M step of each iteration. The first step is adding calibrated noise to the estimated parameters based on Laplacian mechanism. The second step is post-processes those noisy parameters to ensure their intrinsic characteristics based on the theory of vector normalize and positive semi definition matrix. Extensive experiments using both real data sets evaluate the performance of DP-MSNM, and demonstrate that the proposed method outperforms DPGMM.


2021 ◽  
Vol 4 ◽  
Author(s):  
Zhanhong Jiang ◽  
Aditya Balu ◽  
Chinmay Hegde ◽  
Soumik Sarkar

In distributed machine learning, where agents collaboratively learn from diverse private data sets, there is a fundamental tension between consensus and optimality. In this paper, we build on recent algorithmic progresses in distributed deep learning to explore various consensus-optimality trade-offs over a fixed communication topology. First, we propose the incremental consensus-based distributed stochastic gradient descent (i-CDSGD) algorithm, which involves multiple consensus steps (where each agent communicates information with its neighbors) within each SGD iteration. Second, we propose the generalized consensus-based distributed SGD (g-CDSGD) algorithm that enables us to navigate the full spectrum from complete consensus (all agents agree) to complete disagreement (each agent converges to individual model parameters). We analytically establish convergence of the proposed algorithms for strongly convex and nonconvex objective functions; we also analyze the momentum variants of the algorithms for the strongly convex case. We support our algorithms via numerical experiments, and demonstrate significant improvements over existing methods for collaborative deep learning.


2021 ◽  
Vol 71 (5) ◽  
pp. 647-655
Author(s):  
Girish Mishra ◽  
S. K. Pal ◽  
S. V. S. S. N. V. G. Krishna Murthy ◽  
Kanishk Vats ◽  
Rakshak Raina

Modern day lightweight block ciphers provide powerful encryption methods for securing IoT communication data. Tiny digital devices exchange private data which the individual users might not be willing to get disclosed. On the other hand, the adversaries try their level best to capture this private data. The first step towards this is to identify the encryption scheme. This work is an effort to construct a distinguisher to identify the cipher used in encrypting the traffic data. We try to establish a deep learning based method to identify the encryption scheme used from a set of three lightweight block ciphers viz. LBlock, PRESENT and SPECK. We make use of images from MNIST and fashion MNIST data sets for establishing the cryptographic distinguisher. Our results show that the overall classification accuracy depends firstly on the type of key used in encryption and secondly on how frequently the pixel values change in original input image.


Author(s):  
H. Lakshmi H. Lakshmi

At present in our digital world, data comes and leaves cyberspace at huge rates. A representative organization transfers millions of email messages and downloads, stores, and transmits millions of data sets via various channels on a regular basis. Companies always hold private data of customers, stake holders, industry partners, regulators and they expect them to protect. Unfortunately, today’s industries constantly fall victim to massive data loss, and high-profile data leakages involving sensitive personal and corporate data continue to appear (http://opensecurityfoundation.org). Loss of data could significantly damage a company’s goodwill and reputation and could also invite legal issues or regulatory consequences for negligent security. That’s why, organizations should take measures to manage the sensitive data they carried out, how it’s restricted, and how to prevent the loss from being leaked or compromised. In this respect, over the years the database security community has developed a number of different techniques and approaches to assure data confidentiality, integrity, and availability[14]. Thus data loss prevention and in particular protection of data from unauthorized accesses remain important goal of any data management system. Multi Category Security labeling from a user and system administrator standpoint is straightforward. It consists of configuring a set of categories, which are simply text labels, such as "Company_Confidential" or "Medical_Records", and then assigning users to those categories. The system administrator first configures the categories, then assigns users to them as required. The users can then use the labels as they see fit. A system in a home environment may have only one category of "Private", and be configured so that only trusted local users are assigned to this category. In this paper, we first survey the most relevant concepts underlying the notion of database security, types of losses and summarize the menaces to databases and different categories of vulnerabilities in database. This paper focused on Virtual private database, stops various sensitive data from leaving the corporation’s private confines. This paper illustrates and demonstrates how to enable mutli-level access restrictions which ensures accuracy and security,


Author(s):  
Ching-Hua Chuan ◽  
Aleksey Charapko

In this paper, the authors use statistical models to predict the difficulty of recognizing musical keys from polyphonic audio signals. The key recognition difficulty provides important background information when comparing the performance of audio key finding algorithms that often evaluated using different private data sets. Given an audio recording, represented as extracted acoustic features, the authors applied multiple linear regression and proportional odds model to predict the difficulty level of the recording, annotated by three musicians as an integer on a 5-point Likert scale. The authors evaluated the predictions by using root mean square error, Pearson correlation coefficient, exact accuracy, and adjacent accuracy. The authors also discussed issues such as differences found between the musicians' annotations and the consistency of those annotations. To identify potential causes to the perceived difficulty for the individual musicians, the authors applied decision tree-based filtering with bagging. By using weighted naïve Bayes, the authors examined the effectiveness of each identified feature via a classification task.


Author(s):  
Zhiqiang Gao ◽  
Yixiao Sun ◽  
Xiaolong Cui ◽  
Yutao Wang ◽  
Yanyu Duan ◽  
...  

This article describes how the most widely used clustering, k-means, is prone to fall into a local optimum. Notably, traditional clustering approaches are directly performed on private data and fail to cope with malicious attacks in massive data mining tasks against attackers' arbitrary background knowledge. It would result in violation of individuals' privacy, as well as leaks through system resources and clustering outputs. To address these issues, the authors propose an efficient privacy-preserving hybrid k-means under Spark. In the first stage, particle swarm optimization is executed in resilient distributed datasets to initiate the selection of clustering centroids in the k-means on Spark. In the second stage, k-means is executed on the condition that a privacy budget is set as ε/2t with Laplace noise added in each round of iterations. Extensive experimentation on public UCI data sets show that on the premise of guaranteeing utility of privacy data and scalability, their approach outperforms the state-of-the-art varieties of k-means by utilizing swarm intelligence and rigorous paradigms of differential privacy.


2018 ◽  
Vol 14 (2) ◽  
pp. 1-17 ◽  
Author(s):  
Zhiqiang Gao ◽  
Yixiao Sun ◽  
Xiaolong Cui ◽  
Yutao Wang ◽  
Yanyu Duan ◽  
...  

This article describes how the most widely used clustering, k-means, is prone to fall into a local optimum. Notably, traditional clustering approaches are directly performed on private data and fail to cope with malicious attacks in massive data mining tasks against attackers' arbitrary background knowledge. It would result in violation of individuals' privacy, as well as leaks through system resources and clustering outputs. To address these issues, the authors propose an efficient privacy-preserving hybrid k-means under Spark. In the first stage, particle swarm optimization is executed in resilient distributed datasets to initiate the selection of clustering centroids in the k-means on Spark. In the second stage, k-means is executed on the condition that a privacy budget is set as ε/2t with Laplace noise added in each round of iterations. Extensive experimentation on public UCI data sets show that on the premise of guaranteeing utility of privacy data and scalability, their approach outperforms the state-of-the-art varieties of k-means by utilizing swarm intelligence and rigorous paradigms of differential privacy.


2019 ◽  
Vol 9 (2) ◽  
pp. 39-64
Author(s):  
Sumit Kumar Debnath

Electronic information is increasingly shared among unreliable entities. In this context, one interesting problem involves two parties that secretly want to determine an intersection of their respective private data sets while none of them wish to disclose the whole set to the other. One can adopt a Private Set Intersection (PSI) protocol to address this problem preserving the associated security and privacy issues. In this article, the authors present the first PSI protocol that incurs constant (p(k)) communication complexity with linear computation overhead and is fast even for the case of large input sets, where p(k) is a polynomial in security parameter k. Security of this scheme is proven in the standard model against semi-honest entities. The authors combine somewhere statistically binding (SSB) hash function with indistinguishability obfuscation (iO) and space-efficient probabilistic data structure Bloom filter to design the scheme.


Sign in / Sign up

Export Citation Format

Share Document