Differentially Private Data Sets Based on Microaggregation and Record Perturbation

BACKGROUND Collaborative privacy-preserving training methods allow for the integration of locally stored private data sets into machine learning approaches while ensuring confidentiality and nondisclosure. OBJECTIVE In this work we assess the performance of a state-of-the-art neural network approach for the detection of protected health information in texts trained in a collaborative privacy-preserving way. METHODS The training adopts distributed selective stochastic gradient descent (ie, it works by exchanging local learning results achieved on private data sets). Five networks were trained on separated real-world clinical data sets by using the privacy-protecting protocol. In total, the data sets contain 1304 real longitudinal patient records for 296 patients. RESULTS These networks reached a mean F1 value of 0.955. The gold standard centralized training that is based on the union of all sets and does not take data security into consideration reaches a final value of 0.962. CONCLUSIONS Using real-world clinical data, our study shows that detection of protected health information can be secured by collaborative privacy-preserving training. In general, the approach shows the feasibility of deep learning on distributed and confidential clinical data while ensuring data protection.

Download Full-text

Differentially private density estimation with skew-normal mixtures model

Scientific Reports ◽

10.1038/s41598-021-90276-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Weisan Wu

Keyword(s):

Density Estimation ◽

Differential Privacy ◽

Real Data ◽

Estimation Algorithm ◽

Data Sets ◽

Normal Mixtures ◽

Estimated Parameters ◽

Private Data ◽

Asymmetric Problem ◽

Skew Normal

AbstractThe protection of private data is a hot research issue in the era of big data. Differential privacy is a strong privacy guarantees in data analysis. In this paper, we propose DP-MSNM, a parametric density estimation algorithm using multivariate skew-normal mixtures (MSNM) model to differential privacy. MSNM can solve the asymmetric problem of data sets, and it is could approximate any distribution through expectation–maximization (EM) algorithm. In this model, we add two extra steps on the estimated parameters in the M step of each iteration. The first step is adding calibrated noise to the estimated parameters based on Laplacian mechanism. The second step is post-processes those noisy parameters to ensure their intrinsic characteristics based on the theory of vector normalize and positive semi definition matrix. Extensive experiments using both real data sets evaluate the performance of DP-MSNM, and demonstrate that the proposed method outperforms DPGMM.

Download Full-text

On Consensus-Optimality Trade-offs in Collaborative Deep Learning

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.573731 ◽

2021 ◽

Vol 4 ◽

Author(s):

Zhanhong Jiang ◽

Aditya Balu ◽

Chinmay Hegde ◽

Soumik Sarkar

Keyword(s):

Deep Learning ◽

Stochastic Gradient Descent ◽

Model Parameters ◽

Data Sets ◽

Full Spectrum ◽

Strongly Convex ◽

Trade Offs ◽

Private Data ◽

Convex Case ◽

Fundamental Tension

In distributed machine learning, where agents collaboratively learn from diverse private data sets, there is a fundamental tension between consensus and optimality. In this paper, we build on recent algorithmic progresses in distributed deep learning to explore various consensus-optimality trade-offs over a fixed communication topology. First, we propose the incremental consensus-based distributed stochastic gradient descent (i-CDSGD) algorithm, which involves multiple consensus steps (where each agent communicates information with its neighbors) within each SGD iteration. Second, we propose the generalized consensus-based distributed SGD (g-CDSGD) algorithm that enables us to navigate the full spectrum from complete consensus (all agents agree) to complete disagreement (each agent converges to individual model parameters). We analytically establish convergence of the proposed algorithms for strongly convex and nonconvex objective functions; we also analyze the momentum variants of the algorithms for the strongly convex case. We support our algorithms via numerical experiments, and demonstrate significant improvements over existing methods for collaborative deep learning.

Download Full-text

Distinguishing Lightweight Block Ciphers in Encrypted Images

Defence Science Journal ◽

10.14429/dsj.71.16843 ◽

2021 ◽

Vol 71 (5) ◽

pp. 647-655

Author(s):

Girish Mishra ◽

S. K. Pal ◽

S. V. S. S. N. V. G. Krishna Murthy ◽

Kanishk Vats ◽

Rakshak Raina

Keyword(s):

Classification Accuracy ◽

Block Ciphers ◽

Input Image ◽

Data Sets ◽

Encryption Scheme ◽

Traffic Data ◽

Digital Devices ◽

Private Data ◽

Encrypted Images ◽

The Individual

Modern day lightweight block ciphers provide powerful encryption methods for securing IoT communication data. Tiny digital devices exchange private data which the individual users might not be willing to get disclosed. On the other hand, the adversaries try their level best to capture this private data. The first step towards this is to identify the encryption scheme. This work is an effort to construct a distinguisher to identify the cipher used in encrypting the traffic data. We try to establish a deep learning based method to identify the encryption scheme used from a set of three lightweight block ciphers viz. LBlock, PRESENT and SPECK. We make use of images from MNIST and fashion MNIST data sets for establishing the cryptographic distinguisher. Our results show that the overall classification accuracy depends firstly on the type of key used in encryption and secondly on how frequently the pixel values change in original input image.

Download Full-text

A Survey Research of Satisfaction Levels on Preventing Data Loss and Preserving Privacy

International Journal for Research in Engineering Application & Management ◽

10.35291/2454-9150.2020.0001 ◽

2020 ◽

pp. 1-6

Author(s):

H. Lakshmi H. Lakshmi

Keyword(s):

Home Environment ◽

Database Security ◽

Data Management System ◽

Data Sets ◽

Data Loss ◽

Sensitive Data ◽

High Profile ◽

Digital World ◽

Private Data ◽

System Administrator

At present in our digital world, data comes and leaves cyberspace at huge rates. A representative organization transfers millions of email messages and downloads, stores, and transmits millions of data sets via various channels on a regular basis. Companies always hold private data of customers, stake holders, industry partners, regulators and they expect them to protect. Unfortunately, today’s industries constantly fall victim to massive data loss, and high-profile data leakages involving sensitive personal and corporate data continue to appear (http://opensecurityfoundation.org). Loss of data could significantly damage a company’s goodwill and reputation and could also invite legal issues or regulatory consequences for negligent security. That’s why, organizations should take measures to manage the sensitive data they carried out, how it’s restricted, and how to prevent the loss from being leaked or compromised. In this respect, over the years the database security community has developed a number of different techniques and approaches to assure data confidentiality, integrity, and availability[14]. Thus data loss prevention and in particular protection of data from unauthorized accesses remain important goal of any data management system. Multi Category Security labeling from a user and system administrator standpoint is straightforward. It consists of configuring a set of categories, which are simply text labels, such as "Company_Confidential" or "Medical_Records", and then assigning users to those categories. The system administrator first configures the categories, then assigns users to them as required. The users can then use the labels as they see fit. A system in a home environment may have only one category of "Private", and be configured so that only trusted local users are assigned to this category. In this paper, we first survey the most relevant concepts underlying the notion of database security, types of losses and summarize the menaces to databases and different categories of vulnerabilities in database. This paper focused on Virtual private database, stops various sensitive data from leaving the corporation’s private confines. This paper illustrates and demonstrates how to enable mutli-level access restrictions which ensures accuracy and security,

Download Full-text

Predicting Key Recognition Difficulty in Music Using Statistical Learning Techniques

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2014040104 ◽

2014 ◽

Vol 5 (2) ◽

pp. 54-69

Author(s):

Ching-Hua Chuan ◽

Aleksey Charapko

Keyword(s):

Pearson Correlation ◽

Background Information ◽

Difficulty Level ◽

Data Sets ◽

Audio Signals ◽

Private Data ◽

Learning Techniques ◽

Perceived Difficulty ◽

Statistical Learning Techniques ◽

The Individual

In this paper, the authors use statistical models to predict the difficulty of recognizing musical keys from polyphonic audio signals. The key recognition difficulty provides important background information when comparing the performance of audio key finding algorithms that often evaluated using different private data sets. Given an audio recording, represented as extracted acoustic features, the authors applied multiple linear regression and proportional odds model to predict the difficulty level of the recording, annotated by three musicians as an integer on a 5-point Likert scale. The authors evaluated the predictions by using root mean square error, Pearson correlation coefficient, exact accuracy, and adjacent accuracy. The authors also discussed issues such as differences found between the musicians' annotations and the consistency of those annotations. To identify potential causes to the perceived difficulty for the individual musicians, the authors applied decision tree-based filtering with bagging. By using weighted naïve Bayes, the authors examined the effectiveness of each identified feature via a classification task.

Download Full-text

Privacy-Preserving Hybrid K-Means

Censorship, Surveillance, and Privacy ◽

10.4018/978-1-5225-7113-1.ch049 ◽

2019 ◽

pp. 1009-1026

Author(s):

Zhiqiang Gao ◽

Yixiao Sun ◽

Xiaolong Cui ◽

Yutao Wang ◽

Yanyu Duan ◽

...

Keyword(s):

Data Mining ◽

Differential Privacy ◽

Privacy Preserving ◽

Local Optimum ◽

Data Sets ◽

Swarm Optimization ◽

Second Stage ◽

Private Data ◽

Privacy Budget ◽

Selection Of

This article describes how the most widely used clustering, k-means, is prone to fall into a local optimum. Notably, traditional clustering approaches are directly performed on private data and fail to cope with malicious attacks in massive data mining tasks against attackers' arbitrary background knowledge. It would result in violation of individuals' privacy, as well as leaks through system resources and clustering outputs. To address these issues, the authors propose an efficient privacy-preserving hybrid k-means under Spark. In the first stage, particle swarm optimization is executed in resilient distributed datasets to initiate the selection of clustering centroids in the k-means on Spark. In the second stage, k-means is executed on the condition that a privacy budget is set as ε/2t with Laplace noise added in each round of iterations. Extensive experimentation on public UCI data sets show that on the premise of guaranteeing utility of privacy data and scalability, their approach outperforms the state-of-the-art varieties of k-means by utilizing swarm intelligence and rigorous paradigms of differential privacy.

Download Full-text

Privacy-Preserving Hybrid K-Means

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2018040101 ◽

2018 ◽

Vol 14 (2) ◽

pp. 1-17 ◽

Cited By ~ 4

Author(s):

Zhiqiang Gao ◽

Yixiao Sun ◽

Xiaolong Cui ◽

Yutao Wang ◽

Yanyu Duan ◽

...

Keyword(s):

Differential Privacy ◽

State Of The Art ◽

Privacy Preserving ◽

Local Optimum ◽

Massive Data ◽

Data Sets ◽

Second Stage ◽

Private Data ◽

Privacy Budget ◽

Selection Of

This article describes how the most widely used clustering, k-means, is prone to fall into a local optimum. Notably, traditional clustering approaches are directly performed on private data and fail to cope with malicious attacks in massive data mining tasks against attackers' arbitrary background knowledge. It would result in violation of individuals' privacy, as well as leaks through system resources and clustering outputs. To address these issues, the authors propose an efficient privacy-preserving hybrid k-means under Spark. In the first stage, particle swarm optimization is executed in resilient distributed datasets to initiate the selection of clustering centroids in the k-means on Spark. In the second stage, k-means is executed on the condition that a privacy budget is set as ε/2t with Laplace noise added in each round of iterations. Extensive experimentation on public UCI data sets show that on the premise of guaranteeing utility of privacy data and scalability, their approach outperforms the state-of-the-art varieties of k-means by utilizing swarm intelligence and rigorous paradigms of differential privacy.

Download Full-text

Provably Secure Private Set Intersection With Constant Communication Complexity

International Journal of Cyber Warfare and Terrorism ◽

10.4018/ijcwt.2019040104 ◽

2019 ◽

Vol 9 (2) ◽

pp. 39-64

Author(s):

Sumit Kumar Debnath

Keyword(s):

Communication Complexity ◽

Bloom Filter ◽

Security And Privacy ◽

Data Sets ◽

Indistinguishability Obfuscation ◽

Security Parameter ◽

Private Data ◽

Set Intersection ◽

Private Set Intersection ◽

Probabilistic Data Structure

Electronic information is increasingly shared among unreliable entities. In this context, one interesting problem involves two parties that secretly want to determine an intersection of their respective private data sets while none of them wish to disclose the whole set to the other. One can adopt a Private Set Intersection (PSI) protocol to address this problem preserving the associated security and privacy issues. In this article, the authors present the first PSI protocol that incurs constant (p(k)) communication complexity with linear computation overhead and is fast even for the case of large input sets, where p(k) is a polynomial in security parameter k. Security of this scheme is proven in the standard model against semi-honest entities. The authors combine somewhere statistically binding (SSB) hash function with indistinguishability obfuscation (iO) and space-efficient probabilistic data structure Bloom filter to design the scheme.

Download Full-text