Variational Bayes In Private Settings (VIPS)

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11763 ◽

2020 ◽

Vol 68 ◽

pp. 109-157

Author(s):

Mijung Park ◽

James Foulds ◽

Kamalika Chaudhuri ◽

Max Welling

Keyword(s):

Large Scale ◽

Latent Dirichlet Allocation ◽

Probabilistic Models ◽

Data Augmentation ◽

Differential Privacy ◽

Variational Bayes ◽

Sensitive Information ◽

Bayesian Data Analysis ◽

Bayes Algorithm ◽

Real World Datasets

Many applications of Bayesian data analysis involve sensitive information such as personal documents or medical records, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilistic models, called the Conjugate Exponential (CE) family. We observe that we can straightforwardly privatise VB’s approximate posterior distributions for models in the CE family, by perturbing the expected sufficient statistics of the complete-data likelihood. For a broadly-used class of non-CE models, those with binomial likelihoods, we show how to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible, using the Pólya-Gamma data augmentation scheme. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.

Download Full-text

Variational Bayes in Private Settings (VIPS) (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/705 ◽

2020 ◽

Author(s):

James R. Foulds ◽

Mijung Park ◽

Kamalika Chaudhuri ◽

Max Welling

Keyword(s):

Large Scale ◽

Differential Privacy ◽

Broad Class ◽

Variational Bayes ◽

Sensitive Information ◽

Bayesian Data Analysis ◽

Large Scale Data ◽

Stochastic Learning ◽

Bayesian Inference Method ◽

Bayesian Logistic Regression

Download Full-text

On Privacy Protection of Latent Dirichlet Allocation Model Training

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/675 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fangyuan Zhao ◽

Xuebin Ren ◽

Shusen Yang ◽

Xinyu Yang

Keyword(s):

Machine Learning ◽

Latent Dirichlet Allocation ◽

Differential Privacy ◽

Machine Learning Algorithms ◽

Sensitive Information ◽

Training Algorithm ◽

Allocation Model ◽

Model Training ◽

Real World Datasets ◽

Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.

Download Full-text

Pairwise Learning with Differential Privacy Guarantees

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5411 ◽

2020 ◽

Vol 34 (01) ◽

pp. 694-701

Author(s):

Mengdi Huai ◽

Di Wang ◽

Chenglin Miao ◽

Jinhui Xu ◽

Aidong Zhang

Keyword(s):

Differential Privacy ◽

Metric Learning ◽

Loss Functions ◽

Sensitive Information ◽

Training Set ◽

Pairwise Learning ◽

Auc Maximization ◽

General Convex ◽

Convex Loss ◽

Real World Datasets

Pairwise learning has received much attention recently as it is more capable of modeling the relative relationship between pairs of samples. Many machine learning tasks can be categorized as pairwise learning, such as AUC maximization and metric learning. Existing techniques for pairwise learning all fail to take into consideration a critical issue in their design, i.e., the protection of sensitive information in the training set. Models learned by such algorithms can implicitly memorize the details of sensitive information, which offers opportunity for malicious parties to infer it from the learned models. To address this challenging issue, in this paper, we propose several differentially private pairwise learning algorithms for both online and offline settings. Specifically, for the online setting, we first introduce a differentially private algorithm (called OnPairStrC) for strongly convex loss functions. Then, we extend this algorithm to general convex loss functions and give another differentially private algorithm (called OnPairC). For the offline setting, we also present two differentially private algorithms (called OffPairStrC and OffPairC) for strongly and general convex loss functions, respectively. These proposed algorithms can not only learn the model effectively from the data but also provide strong privacy protection guarantee for sensitive information in the training set. Extensive experiments on real-world datasets are conducted to evaluate the proposed algorithms and the experimental results support our theoretical analysis.

Download Full-text

A Neuron Noise-Injection Technique for Privacy Preserving Deep Neural Networks

Open Computer Science ◽

10.1515/comp-2020-0133 ◽

2020 ◽

Vol 10 (1) ◽

pp. 137-152

Author(s):

Tosin A. Adesuyi ◽

Byeong Man Kim

Keyword(s):

Differential Privacy ◽

Real Life ◽

Privacy Preserving ◽

Training Dataset ◽

Injection Technique ◽

Sensitive Information ◽

Contribution Ratio ◽

Noise Injection ◽

Real World Datasets ◽

The Right

AbstractData is the key to information mining that unveils hidden knowledge. The ability to revealed knowledge relies on the extractable features of a dataset and likewise the depth of the mining model. Conversely, several of these datasets embed sensitive information that can engender privacy violation and are subsequently used to build deep neural network (DNN) models. Recent approaches to enact privacy and protect data sensitivity in DNN models does decline accuracy, thus, giving rise to significant accuracy disparity between a non-private DNN and a privacy preserving DNN model. This accuracy gap is due to the enormous uncalculated noise flooding and the inability to quantify the right level of noise required to perturb distinct neurons in the DNN model, hence, a dent in accuracy. Consequently, this has hindered the use of privacy protected DNN models in real life applications. In this paper, we present a neuron noise-injection technique based on layer-wise buffered contribution ratio forwarding and ϵ-differential privacy technique to preserve privacy in a DNN model. We adapt a layer-wise relevance propagation technique to compute contribution ratio for each neuron in our network at the pre-training phase. Based on the proportion of each neuron’s contribution ratio, we generate a noise-tuple via the Laplace mechanism, and this helps to eliminate unwanted noise flooding. The noise-tuple is subsequently injected into the training network through its neurons to preserve privacy of the training dataset in a differentially private manner. Hence, each neuron receives right proportion of noise as estimated via contribution ratio, and as a result, unquantifiable noise that drops accuracy of privacy preserving DNN models is avoided. Extensive experiments were conducted based on three real-world datasets and their results show that our approach was able to narrow down the existing accuracy gap to a close proximity, as well outperforms the state-of-the-art approaches in this context.

Download Full-text

Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v4i1.612 ◽

2012 ◽

Vol 4 (1) ◽

Cited By ~ 35

Author(s):

Benjamin I. P. Rubinstein ◽

Peter L. Bartlett ◽

Ling Huang ◽

Nina Taft

Keyword(s):

High Probability ◽

Large Scale ◽

Differential Privacy ◽

Feature Space ◽

Privacy Preserving ◽

Training Data ◽

Inner Product ◽

Support Vector ◽

Sensitive Information ◽

Finite Dimensional

The ubiquitous need for analyzing privacy-sensitive information—including health records, personal communications, product ratings and social network data—is driving significant interest in privacy-preserving data analysis across several research communities. This paper explores the release of Support Vector Machine (SVM) classifiers while preserving the privacy of training data. The SVM is a popular machine learning method that maps data to a high-dimensional feature space before learning a linear decision boundary. We present efficient mechanisms for finite-dimensional feature mappings and for (potentially infinite-dimensional) mappings with translation-invariant kernels. In the latter case, our mechanism borrows a technique from large-scale learning to learn in a finite-dimensional feature space whose inner-product uniformly approximates the desired feature space inner-product (the desired kernel) with high probability. Differential privacy is established using algorithmic stability, a property used in learning theory to bound generalization error. Utility—when the private classifier is pointwise close to the non-private classifier with high probability—is proven using smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. Finally we conclude with lower bounds on the differential privacy of any mechanism approximating the SVM.

Download Full-text

Autonomic Workload Performance Modeling for Large-Scale Databases and Data Warehouses through Deep Belief Network with Data Augmentation using Conditional Generative Adversarial Networks

IEEE Access ◽

10.1109/access.2021.3096039 ◽

2021 ◽

pp. 1-1

Author(s):

Nusrat Shaheen ◽

Basit Raza ◽

Ahmad Raza Shahid ◽

Ahmad Kamran Malik

Keyword(s):

Large Scale ◽

Performance Modeling ◽

Data Augmentation ◽

Deep Belief Network ◽

Generative Adversarial Networks ◽

Data Warehouses ◽

Belief Network ◽

Adversarial Networks

Download Full-text

Gravity Control-Based Data Augmentation Technique for Improving VR User Activity Recognition

Symmetry ◽

10.3390/sym13050845 ◽

2021 ◽

Vol 13 (5) ◽

pp. 845

Author(s):

Dongheun Han ◽

Chulwoo Lee ◽

Hyeongyeop Kang

Keyword(s):

Activity Recognition ◽

Large Scale ◽

Data Augmentation ◽

Training Data ◽

Measurement Unit ◽

Gravitational Acceleration ◽

The Neural Network ◽

Typical Data ◽

Robust Recognition ◽

Gravity Acceleration

The neural-network-based human activity recognition (HAR) technique is being increasingly used for activity recognition in virtual reality (VR) users. The major issue of a such technique is the collection large-scale training datasets which are key for deriving a robust recognition model. However, collecting large-scale data is a costly and time-consuming process. Furthermore, increasing the number of activities to be classified will require a much larger number of training datasets. Since training the model with a sparse dataset can only provide limited features to recognition models, it can cause problems such as overfitting and suboptimal results. In this paper, we present a data augmentation technique named gravity control-based augmentation (GCDA) to alleviate the sparse data problem by generating new training data based on the existing data. The benefits of the symmetrical structure of the data are that it increased the number of data while preserving the properties of the data. The core concept of GCDA is two-fold: (1) decomposing the acceleration data obtained from the inertial measurement unit (IMU) into zero-gravity acceleration and gravitational acceleration, and augmenting them separately, and (2) exploiting gravity as a directional feature and controlling it to augment training datasets. Through the comparative evaluations, we validated that the application of GCDA to training datasets showed a larger improvement in classification accuracy (96.39%) compared to the typical data augmentation methods (92.29%) applied and those that did not apply the augmentation method (85.21%).

Download Full-text

UVLens

Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies ◽

10.1145/3463495 ◽

2021 ◽

Vol 5 (2) ◽

pp. 1-26

Author(s):

Longbiao Chen ◽

Chenhui Lu ◽

Fangxu Yuan ◽

Zhihan Jiang ◽

Leye Wang ◽

...

Keyword(s):

Large Scale ◽

Floating Population ◽

Open Government ◽

Residential Areas ◽

Rapid Urbanization ◽

Urban Village ◽

Open Government Data ◽

Government Data ◽

Real World Datasets ◽

Urban Villages

Urban villages refer to the residential areas lagging behind the rapid urbanization process in many developing countries. These areas are usually with overcrowded buildings, high population density, and low living standards, bringing potential risks of public safety and hindering the urban development. Therefore, it is crucial for urban authorities to identify the boundaries of urban villages and estimate their resident and floating populations so as to better renovate and manage these areas. Traditional approaches, such as field surveys and demographic census, are time consuming and labor intensive, lacking a comprehensive understanding of urban villages. Against this background, we propose a two-phase framework for urban village boundary identification and population estimation. Specifically, based on heterogeneous open government data, the proposed framework can not only accurately identify the boundaries of urban villages from large-scale satellite imagery by fusing road networks guided patches with bike-sharing drop-off patterns, but also accurately estimate the resident and floating populations of urban villages with a proposed multi-view neural network model. We evaluate our method leveraging real-world datasets collected from Xiamen Island. Results show that our framework can accurately identify the urban village boundaries with an IoU of 0.827, and estimate the resident population and floating population with R2 of 0.92 and 0.94 respectively, outperforming the baseline methods. We also deploy our system on the Xiamen Open Government Data Platform to provide services to both urban authorities and citizens.

Download Full-text

The information complexity of learning tasks, their structure and their distance

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa033 ◽

2021 ◽

Author(s):

Alessandro Achille ◽

Giovanni Paolini ◽

Glen Mbeng ◽

Stefano Soatto

Keyword(s):

Kolmogorov Complexity ◽

Large Scale ◽

Parametric Model ◽

Training Dataset ◽

Optimization Scheme ◽

Learning Tasks ◽

Asymmetric Distance ◽

Special Cases ◽

Scale Models ◽

Real World Datasets

Abstract We introduce an asymmetric distance in the space of learning tasks and a framework to compute their complexity. These concepts are foundational for the practice of transfer learning, whereby a parametric model is pre-trained for a task, and then fine tuned for another. The framework we develop is non-asymptotic, captures the finite nature of the training dataset and allows distinguishing learning from memorization. It encompasses, as special cases, classical notions from Kolmogorov complexity and Shannon and Fisher information. However, unlike some of those frameworks, it can be applied to large-scale models and real-world datasets. Our framework is the first to measure complexity in a way that accounts for the effect of the optimization scheme, which is critical in deep learning.

Download Full-text

MetaTP

Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies ◽

10.1145/3478083 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-28

Author(s):

Weida Zhong ◽

Qiuling Suo ◽

Abhishek Gupta ◽

Xiaowei Jia ◽

Chunming Qiao ◽

...

Keyword(s):

Time Series ◽

Large Scale ◽

Multivariate Time Series ◽

Modern Society ◽

Training Data ◽

Traffic Prediction ◽

Temporal Prediction ◽

Reference Space ◽

Meta Learning ◽

Real World Datasets

With the popularity of smartphones, large-scale road sensing data is being collected to perform traffic prediction, which is an important task in modern society. Due to the nature of the roving sensors on smartphones, the collected traffic data which is in the form of multivariate time series, is often temporally sparse and unevenly distributed across regions. Moreover, different regions can have different traffic patterns, which makes it challenging to adapt models learned from regions with sufficient training data to target regions. Given that many regions may have very sparse data, it is also impossible to build individual models for each region separately. In this paper, we propose a meta-learning based framework named MetaTP to overcome these challenges. MetaTP has two key parts, i.e., basic traffic prediction network (base model) and meta-knowledge transfer. In base model, a two-layer interpolation network is employed to map original time series onto uniformly-spaced reference time points, so that temporal prediction can be effectively performed in the reference space. The meta-learning framework is employed to transfer knowledge from source regions with a large amount of data to target regions with a few data examples via fast adaptation, in order to improve model generalizability on target regions. Moreover, we use two memory networks to capture the global patterns of spatial and temporal information across regions. We evaluate the proposed framework on two real-world datasets, and experimental results show the effectiveness of the proposed framework.

Download Full-text