A Defense Framework for Privacy Risks in Remote Machine Learning Service

In recent years, machine learning approaches have been widely adopted for many applications, including classification. Machine learning models deal with collective sensitive data usually trained in a remote public cloud server, for instance, machine learning as a service (MLaaS) system. In this scene, users upload their local data and utilize the computation capability to train models, or users directly access models trained by MLaaS. Unfortunately, recent works reveal that the curious server (that trains the model with users’ sensitive local data and is curious to know the information about individuals) and the malicious MLaaS user (who abused to query from the MLaaS system) will cause privacy risks. The adversarial method as one of typical mitigation has been studied by several recent works. However, most of them focus on the privacy-preserving against the malicious user; in other words, they commonly consider the data owner and the model provider as one role. Under this assumption, the privacy leakage risks from the curious server are neglected. Differential privacy methods can defend against privacy threats from both the curious sever and the malicious MLaaS user by directly adding noise to the training data. Nonetheless, the differential privacy method will decrease the classification accuracy of the target model heavily. In this work, we propose a generic privacy-preserving framework based on the adversarial method to defend both the curious server and the malicious MLaaS user. The framework can adapt with several adversarial algorithms to generate adversarial examples directly with data owners’ original data. By doing so, sensitive information about the original data is hidden. Then, we explore the constraint conditions of this framework which help us to find the balance between privacy protection and the model utility. The experiments’ results show that our defense framework with the AdvGAN method is effective against MIA and our defense framework with the FGSM method can protect the sensitive data from direct content exposed attacks. In addition, our method can achieve better privacy and utility balance compared to the existing method.

Download Full-text

Emerging Technologies in a Modern Competitive Scenario

Digital Transformation and Challenges to Data Security and Privacy - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-4201-9.ch001 ◽

2021 ◽

pp. 1-16

Author(s):

George Leal Jamil ◽

Alexis Rocha da Silva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Statistical Analysis ◽

Differential Privacy ◽

Social Issues ◽

Training Data ◽

Sensitive Data ◽

Machine Learning Model ◽

Highly Sensitive ◽

Inference Attacks

Users' personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it. Users can neither delete nor restrict the purposes for which it is used. Learning how to machine learning that protects privacy, we can make a huge difference in solving many social issues like curing disease, etc. Deep neural networks are susceptible to various inference attacks as they remember information about their training data. In this chapter, the authors introduce differential privacy, which ensures that different kinds of statistical analysis don't compromise privacy and federated learning, training a machine learning model on a data to which we do not have access to.

Download Full-text

Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v4i1.612 ◽

2012 ◽

Vol 4 (1) ◽

Cited By ~ 35

Author(s):

Benjamin I. P. Rubinstein ◽

Peter L. Bartlett ◽

Ling Huang ◽

Nina Taft

Keyword(s):

High Probability ◽

Large Scale ◽

Differential Privacy ◽

Feature Space ◽

Privacy Preserving ◽

Training Data ◽

Inner Product ◽

Support Vector ◽

Sensitive Information ◽

Finite Dimensional

The ubiquitous need for analyzing privacy-sensitive information—including health records, personal communications, product ratings and social network data—is driving significant interest in privacy-preserving data analysis across several research communities. This paper explores the release of Support Vector Machine (SVM) classifiers while preserving the privacy of training data. The SVM is a popular machine learning method that maps data to a high-dimensional feature space before learning a linear decision boundary. We present efficient mechanisms for finite-dimensional feature mappings and for (potentially infinite-dimensional) mappings with translation-invariant kernels. In the latter case, our mechanism borrows a technique from large-scale learning to learn in a finite-dimensional feature space whose inner-product uniformly approximates the desired feature space inner-product (the desired kernel) with high probability. Differential privacy is established using algorithmic stability, a property used in learning theory to bound generalization error. Utility—when the private classifier is pointwise close to the non-private classifier with high probability—is proven using smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. Finally we conclude with lower bounds on the differential privacy of any mechanism approximating the SVM.

Download Full-text

Using word embeddings to improve the privacy of clinical notes

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa038 ◽

2020 ◽

Vol 27 (6) ◽

pp. 901-907

Author(s):

Mohamed Abdalla ◽

Moustafa Abdalla ◽

Frank Rudzicz ◽

Graeme Hirst

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Differential Privacy ◽

Low Cost ◽

Original Data ◽

Sensitive Information ◽

Learning Approaches ◽

Clinical Notes ◽

Classification Tasks

Abstract Objective In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). Materials and Methods We employ a new “random replacement” paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current “search-and-secure” paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. Results We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. Discussion As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. Conclusion Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers.

Download Full-text

Multilayer Soil Moisture Mapping at a Regional Scale from Multisource Data via a Machine Learning Method

Remote Sensing ◽

10.3390/rs11030284 ◽

2019 ◽

Vol 11 (3) ◽

pp. 284 ◽

Cited By ~ 1

Author(s):

Linglin Zeng ◽

Shun Hu ◽

Daxiang Xiang ◽

Xiang Zhang ◽

Deren Li ◽

...

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Regional Scale ◽

Remotely Sensed ◽

Temporal Variations ◽

Training Data ◽

Estimation Accuracy ◽

Learning Approaches ◽

Remotely Sensed Data ◽

Deep Soil

Soil moisture mapping at a regional scale is commonplace since these data are required in many applications, such as hydrological and agricultural analyses. The use of remotely sensed data for the estimation of deep soil moisture at a regional scale has received far less emphasis. The objective of this study was to map the 500-m, 8-day average and daily soil moisture at different soil depths in Oklahoma from remotely sensed and ground-measured data using the random forest (RF) method, which is one of the machine-learning approaches. In order to investigate the estimation accuracy of the RF method at both a spatial and a temporal scale, two independent soil moisture estimation experiments were conducted using data from 2010 to 2014: a year-to-year experiment (with a root mean square error (RMSE) ranging from 0.038 to 0.050 m3/m3) and a station-to-station experiment (with an RMSE ranging from 0.044 to 0.057 m3/m3). Then, the data requirements, importance factors, and spatial and temporal variations in estimation accuracy were discussed based on the results using the training data selected by iterated random sampling. The highly accurate estimations of both the surface and the deep soil moisture for the study area reveal the potential of RF methods when mapping soil moisture at a regional scale, especially when considering the high heterogeneity of land-cover types and topography in the study area.

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

FedPARL: Client Activity and Resource-Oriented Lightweight Federated Learning Model for Resource-Constrained Heterogeneous IoT Environment

Frontiers in Communications and Networks ◽

10.3389/frcmn.2021.657653 ◽

2021 ◽

Vol 2 ◽

Author(s):

Ahmed Imteaj ◽

M. Hadi Amini

Keyword(s):

Machine Learning ◽

Resource Availability ◽

Resource Constraints ◽

Training Model ◽

Convergence Time ◽

Battery Life ◽

Sensitive Information ◽

Learning Approaches ◽

Resource Constrained ◽

Distributed Machine Learning

Federated Learning (FL) is a recently invented distributed machine learning technique that allows available network clients to perform model training at the edge, rather than sharing it with a centralized server. Unlike conventional distributed machine learning approaches, the hallmark feature of FL is to allow performing local computation and model generation on the client side, ultimately protecting sensitive information. Most of the existing FL approaches assume that each FL client has sufficient computational resources and can accomplish a given task without facing any resource-related issues. However, if we consider FL for a heterogeneous Internet of Things (IoT) environment, a major portion of the FL clients may face low resource availability (e.g., lower computational power, limited bandwidth, and battery life). Consequently, the resource-constrained FL clients may give a very slow response, or may be unable to execute expected number of local iterations. Further, any FL client can inject inappropriate model during a training phase that can prolong convergence time and waste resources of all the network clients. In this paper, we propose a novel tri-layer FL scheme, Federated Proximal, Activity and Resource-Aware 31 Lightweight model (FedPARL), that reduces model size by performing sample-based pruning, avoids misbehaved clients by examining their trust score, and allows partial amount of work by considering their resource-availability. The pruning mechanism is particularly useful while dealing with resource-constrained FL-based IoT (FL-IoT) clients. In this scenario, the lightweight training model will consume less amount of resources to accomplish a target convergence. We evaluate each interested client's resource-availability before assigning a task, monitor their activities, and update their trust scores based on their previous performance. To tackle system and statistical heterogeneities, we adapt a re-parameterization and generalization of the current state-of-the-art Federated Averaging (FedAvg) algorithm. The modification of FedAvg algorithm allows clients to perform variable or partial amounts of work considering their resource-constraints. We demonstrate that simultaneously adapting the coupling of pruning, resource and activity awareness, and re-parameterization of FedAvg algorithm leads to more robust convergence of FL in IoT environment.

Download Full-text

Inductive learning and local differential privacy for privacy-preserving offloading in mobile edge intelligent systems

10.36227/techrxiv.13698883 ◽

2021 ◽

Author(s):

Jude TCHAYE-KONDI ◽

Yanlong Zhai ◽

Liehuang Zhu

Keyword(s):

Intelligent Systems ◽

Differential Privacy ◽

Inductive Learning ◽

Random Noise ◽

Privacy Preserving ◽

Sensitive Data ◽

Privacy Concerns ◽

Feature Extractor ◽

Data Source ◽

Series Of Experiments

<div>We address privacy and latency issues in the edge/cloud computing environment while training a centralized AI model. In our particular case, the edge devices are the only data source for the model to train on the central server. Current privacy-preserving and reducing network latency solutions rely on a pre-trained feature extractor deployed on the devices to help extract only important features from the sensitive dataset. However, finding a pre-trained model or pubic dataset to build a feature extractor for certain tasks may turn out to be very challenging. With the large amount of data generated by edge devices, the edge environment does not really lack data, but its improper access may lead to privacy concerns. In this paper, we present DeepGuess , a new privacy-preserving, and latency aware deeplearning framework. DeepGuess uses a new learning mechanism enabled by the AutoEncoder(AE) architecture called Inductive Learning, which makes it possible to train a central neural network using the data produced by end-devices while preserving their privacy. With inductive learning, sensitive data remains on devices and is not explicitly involved in any backpropagation process. The AE’s Encoder is deployed on devices to extracts and transfers important features to the server. To enhance privacy, we propose a new local deferentially private algorithm that allows the Edge devices to apply random noise to features extracted from their sensitive data before transferred to an untrusted server. The experimental evaluation of DeepGuess demonstrates its effectiveness and ability to converge on a series of experiments.</div>

Download Full-text

IDS for Industrial Applications: A Federated Learning Approach with Active Personalization

Sensors ◽

10.3390/s21206743 ◽

2021 ◽

Vol 21 (20) ◽

pp. 6743

Author(s):

Vasiliki Kelli ◽

Vasileios Argyriou ◽

Thomas Lagkas ◽

George Fragulis ◽

Elisavet Grigoriou ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Network Flow ◽

Detection System ◽

Human Life ◽

Industrial Sector ◽

Machine Learning Techniques ◽

Sensitive Information ◽

Learning Approaches ◽

Monitoring And Control

Internet of Things (IoT) is a concept adopted in nearly every aspect of human life, leading to an explosive utilization of intelligent devices. Notably, such solutions are especially integrated in the industrial sector, to allow the remote monitoring and control of critical infrastructure. Such global integration of IoT solutions has led to an expanded attack surface against IoT-enabled infrastructures. Artificial intelligence and machine learning have demonstrated their ability to resolve issues that would have been impossible or difficult to address otherwise; thus, such solutions are closely associated with securing IoT. Classical collaborative and distributed machine learning approaches are known to compromise sensitive information. In our paper, we demonstrate the creation of a network flow-based Intrusion Detection System (IDS) aiming to protecting critical infrastructures, stemming from the pairing of two machine learning techniques, namely, federated learning and active learning. The former is utilized for privately training models in federation, while the latter is a semi-supervised approach applied for global model adaptation to each of the participant’s traffic. Experimental results indicate that global models perform significantly better for each participant, when locally personalized with just a few active learning queries. Specifically, we demonstrate how the accuracy increase can reach 7.07% in only 10 queries.

Download Full-text

3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design

10.26434/chemrxiv.11833323.v2 ◽

2020 ◽

Author(s):

Paul Francoeur ◽

Tomohide Masuda ◽

David R. Koes

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Mean Squared Error ◽

Comprehensive Evaluation ◽

Training Data ◽

Learning Approaches ◽

Neural Network Models ◽

Structure Based Drug Design ◽

Affinity Prediction

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.

Download Full-text

Differentially Private Image Classification Using Support Vector Machine and Differential Privacy

Machine Learning and Knowledge Extraction ◽

10.3390/make1010029 ◽

2019 ◽

Vol 1 (1) ◽

pp. 483-491 ◽

Cited By ~ 6

Author(s):

Makhamisa Senekane

Keyword(s):

Support Vector Machine ◽

Data Analysis ◽

Image Classification ◽

Differential Privacy ◽

Privacy Preserving ◽

Global Optimum ◽

Support Vector ◽

Sensitive Data ◽

Radiological Images ◽

Golden Standard

The ubiquity of data, including multi-media data such as images, enables easy mining and analysis of such data. However, such an analysis might involve the use of sensitive data such as medical records (including radiological images) and financial records. Privacy-preserving machine learning is an approach that is aimed at the analysis of such data in such a way that privacy is not compromised. There are various privacy-preserving data analysis approaches such as k-anonymity, l-diversity, t-closeness and Differential Privacy (DP). Currently, DP is a golden standard of privacy-preserving data analysis due to its robustness against background knowledge attacks. In this paper, we report a scheme for privacy-preserving image classification using Support Vector Machine (SVM) and DP. SVM is chosen as a classification algorithm because unlike variants of artificial neural networks, it converges to a global optimum. SVM kernels used are linear and Radial Basis Function (RBF), while ϵ -differential privacy was the DP framework used. The proposed scheme achieved an accuracy of up to 98%. The results obtained underline the utility of using SVM and DP for privacy-preserving image classification.

Download Full-text