Federated Model Distillation with Noise-Free Differential Privacy

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/216 ◽

2021 ◽

Author(s):

Lichao Sun ◽

Lingjuan Lyu

Keyword(s):

Data Privacy ◽

Differential Privacy ◽

Random Noise ◽

Model Performance ◽

Training Data ◽

Heterogeneous Model ◽

The Public ◽

Inference Attacks ◽

Public Datasets ◽

Privacy Budget

Conventional federated learning directly averages model weights, which is only possible for collaboration between models with homogeneous architectures. Sharing prediction instead of weight removes this obstacle and eliminates the risk of white-box inference attacks in conventional federated learning. However, the predictions from local models are sensitive and would leak training data privacy to the public. To address this issue, one naive approach is adding the differentially private random noise to the predictions, which however brings a substantial trade-off between privacy budget and model performance. In this paper, we propose a novel framework called FEDMD-NFDP, which applies a Noise-FreeDifferential Privacy (NFDP) mechanism into a federated model distillation framework. Our extensive experimental results on various datasets validate that FEDMD-NFDP can deliver not only comparable utility and communication efficiency but also provide a noise-free differential privacy guarantee. We also demonstrate the feasibility of our FEDMD-NFDP by considering both IID and Non-IID settings, heterogeneous model architectures, and unlabelled public datasets from a different distribution.

Download Full-text

Performance Evaluation of Deep CNN-Based Crack Detection and Localization Techniques for Concrete Structures

Sensors ◽

10.3390/s21051688 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1688

Author(s):

Luqman Ali ◽

Fady Alnajjar ◽

Hamad Al Jassmi ◽

Munkhjargal Gochoo ◽

Wasif Khan ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Crack Detection ◽

Concrete Structures ◽

Model Performance ◽

Training Data ◽

Computational Time ◽

Data Heterogeneity ◽

Public Datasets ◽

Detection And Localization

This paper proposes a customized convolutional neural network for crack detection in concrete structures. The proposed method is compared to four existing deep learning methods based on training data size, data heterogeneity, network complexity, and the number of epochs. The performance of the proposed convolutional neural network (CNN) model is evaluated and compared to pretrained networks, i.e., the VGG-16, VGG-19, ResNet-50, and Inception V3 models, on eight datasets of different sizes, created from two public datasets. For each model, the evaluation considered computational time, crack localization results, and classification measures, e.g., accuracy, precision, recall, and F1-score. Experimental results demonstrated that training data size and heterogeneity among data samples significantly affect model performance. All models demonstrated promising performance on a limited number of diverse training data; however, increasing the training data size and reducing diversity reduced generalization performance, and led to overfitting. The proposed customized CNN and VGG-16 models outperformed the other methods in terms of classification, localization, and computational time on a small amount of data, and the results indicate that these two models demonstrate superior crack detection and localization for concrete structures.

Download Full-text

Privacy-Preserving Gradient Boosting Decision Trees

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5422 ◽

2020 ◽

Vol 34 (01) ◽

pp. 784-791 ◽

Cited By ~ 1

Author(s):

Qinbin Li ◽

Zhaomin Wu ◽

Zeyi Wen ◽

Bingsheng He

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Training Data ◽

Gradient Boosting ◽

Training Algorithm ◽

Model Accuracy ◽

Machine Learning Model ◽

Improve Model ◽

Privacy Budget ◽

Privacy Level

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.

Download Full-text

Methodology for Collecting a Training Dataset for an Intrusion Detection Model

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(5)-5 ◽

2021 ◽

Vol 33 (5) ◽

pp. 83-104

Author(s):

Aleksandr Igorevich Getman ◽

Maxim Nikolaevich Goryunov ◽

Andrey Georgievich Matskevich ◽

Dmitry Aleksandrovich Rybolovlev

Keyword(s):

Attack Detection ◽

Training Data ◽

Training Dataset ◽

Training Models ◽

The Public ◽

Detection Model ◽

Computer Attacks ◽

Model Training ◽

Public Datasets

The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.

Download Full-text

Disparate Vulnerability to Membership Inference Attacks

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2022-0023 ◽

2021 ◽

Vol 2022 (1) ◽

pp. 460-480

Author(s):

Bogdan Kulynych ◽

Mohammad Yaghini ◽

Giovanni Cherubin ◽

Michael Veale ◽

Carmela Troncoso

Keyword(s):

Differential Privacy ◽

Sufficient Conditions ◽

Training Data ◽

Real World Data ◽

Statistical Framework ◽

Machine Learning Model ◽

Data Record ◽

Depth Study ◽

Inference Attacks ◽

Necessary And Sufficient

Abstract A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model’s training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability by naïvely applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding significant evidence of disparate vulnerability in realistic settings.

Download Full-text

Towards Understanding the Risks of Gradient Inversion in Federated Learning

10.21203/rs.3.rs-1147182/v2 ◽

2021 ◽

Author(s):

Ali Hatamizadeh ◽

Hongxu Yin ◽

Pavlo Molchanov ◽

Andriy Myronenko ◽

Wenqi Li ◽

...

Keyword(s):

Neural Networks ◽

Data Privacy ◽

Deep Neural Networks ◽

Differential Privacy ◽

Use Cases ◽

Training Data ◽

Model Accuracy ◽

Healthcare Applications ◽

Raw Data ◽

Collaborative Training

Abstract Federated learning (FL) allows the collaborative training of AI models without needing to share raw data. This capability makes it especially interesting for healthcare applications where patient and data privacy is of utmost concern. However, recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data. In this work, we show that these attacks presented in the literature are impractical in real FL use-cases and provide a new baseline attack that works for more realistic scenarios where the clients’ training involves updating the Batch Normalization (BN) statistics. Furthermore, we present new ways to measure and visualize potential data leakage in FL. Our work is a step towards establishing reproducible methods of measuring data leakage in FL and could help determine the optimal tradeoffs between privacy-preserving techniques, such as differential privacy, and model accuracy based on quantifiable metrics.

Download Full-text

Clouded data: Privacy and the promise of encryption

Big Data & Society ◽

10.1177/2053951719848781 ◽

2019 ◽

Vol 6 (1) ◽

pp. 205395171984878

Author(s):

Luke Munn ◽

Tsvetelina Hristova ◽

Liam Magee

Keyword(s):

Data Privacy ◽

Differential Privacy ◽

Homomorphic Encryption ◽

Personal Data ◽

Individual Identification ◽

Student Survey ◽

Public And Private ◽

The Public ◽

Synthetic Datasets ◽

Computational Resources

Personal data is highly vulnerable to security exploits, spurring moves to lock it down through encryption, to cryptographically ‘cloud’ it. But personal data is also highly valuable to corporations and states, triggering moves to unlock its insights by relocating it in the cloud. We characterise this twinned condition as ‘clouded data’. Clouded data constructs a political and technological notion of privacy that operates through the intersection of corporate power, computational resources and the ability to obfuscate, gain insights from and valorise a dependency between public and private. First, we survey prominent clouded data approaches (blockchain, multiparty computation, differential privacy, and homomorphic encryption), suggesting their particular affordances produce distinctive versions of privacy. Next, we perform two notional code-based experiments using synthetic datasets. In the field of health, we submit a patient’s blood pressure to a notional cloud-based diagnostics service; in education, we construct a student survey that enables aggregate reporting without individual identification. We argue that these technical affordances legitimate new political claims to capture and commodify personal data. The final section broadens the discussion to consider the political force of clouded data and its reconstitution of traditional notions such as the public and the private.

Download Full-text

Emerging Technologies in a Modern Competitive Scenario

Digital Transformation and Challenges to Data Security and Privacy - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-4201-9.ch001 ◽

2021 ◽

pp. 1-16

Author(s):

George Leal Jamil ◽

Alexis Rocha da Silva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Statistical Analysis ◽

Differential Privacy ◽

Social Issues ◽

Training Data ◽

Sensitive Data ◽

Machine Learning Model ◽

Highly Sensitive ◽

Inference Attacks

Users' personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it. Users can neither delete nor restrict the purposes for which it is used. Learning how to machine learning that protects privacy, we can make a huge difference in solving many social issues like curing disease, etc. Deep neural networks are susceptible to various inference attacks as they remember information about their training data. In this chapter, the authors introduce differential privacy, which ensures that different kinds of statistical analysis don't compromise privacy and federated learning, training a machine learning model on a data to which we do not have access to.

Download Full-text

Logistic Regression on Homomorphic Encrypted Data at Scale

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019466 ◽

2019 ◽

Vol 33 ◽

pp. 9466-9471 ◽

Cited By ~ 2

Author(s):

Kyoohyung Han ◽

Seungwan Hong ◽

Jung Hee Cheon ◽

Daejun Park

Keyword(s):

Logistic Regression ◽

Data Privacy ◽

Homomorphic Encryption ◽

Training Data ◽

Sensitive Data ◽

Encrypted Data ◽

The Public ◽

Private Data ◽

First Time ◽

Practical Feasibility

Machine learning on (homomorphic) encrypted data is a cryptographic method for analyzing private and/or sensitive data while keeping privacy. In the training phase, it takes as input an encrypted training data and outputs an encrypted model without ever decrypting. In the prediction phase, it uses the encrypted model to predict results on new encrypted data. In each phase, no decryption key is needed, and thus the data privacy is ultimately guaranteed. It has many applications in various areas such as finance, education, genomics, and medical field that have sensitive private data. While several studies have been reported on the prediction phase, few studies have been conducted on the training phase.In this paper, we present an efficient algorithm for logistic regression on homomorphic encrypted data, and evaluate our algorithm on real financial data consisting of 422,108 samples over 200 features. Our experiment shows that an encrypted model with a sufficient Kolmogorov Smirnow statistic value can be obtained in ∼17 hours in a single machine. We also evaluate our algorithm on the public MNIST dataset, and it takes ∼2 hours to learn an encrypted model with 96.4% accuracy. Considering the inefficiency of homomorphic encryption, our result is encouraging and demonstrates the practical feasibility of the logistic regression training on large encrypted data, for the first time to the best of our knowledge.

Download Full-text

Towards Understanding the Risks of Gradient Inversion in Federated Learning

10.21203/rs.3.rs-1147182/v1 ◽

2021 ◽

Author(s):

Ali Hatamizadeh ◽

Hongxu Yin ◽

Pavlo Molchanov ◽

Andriy Myronenko ◽

Wenqi Li ◽

...

Keyword(s):

Neural Networks ◽

Data Privacy ◽

Deep Neural Networks ◽

Differential Privacy ◽

Use Cases ◽

Training Data ◽

Model Accuracy ◽

Healthcare Applications ◽

Raw Data ◽

Collaborative Training

Download Full-text

An Adaptive Deep Ensemble Learning Method for Dynamic Evolving Diagnostic Task Scenarios

Diagnostics ◽

10.3390/diagnostics11122288 ◽

2021 ◽

Vol 11 (12) ◽

pp. 2288

Author(s):

Kaixiang Su ◽

Jiao Wu ◽

Dongxiao Gu ◽

Shanlin Yang ◽

Shuyuan Deng ◽

...

Keyword(s):

Machine Learning ◽

Ensemble Learning ◽

Model Performance ◽

Optimal Number ◽

Training Data ◽

Learning Method ◽

Learning Models ◽

Proposed Model ◽

Public Datasets ◽

Machine Learning Models

Increasingly, machine learning methods have been applied to aid in diagnosis with good results. However, some complex models can confuse physicians because they are difficult to understand, while data differences across diagnostic tasks and institutions can cause model performance fluctuations. To address this challenge, we combined the Deep Ensemble Model (DEM) and tree-structured Parzen Estimator (TPE) and proposed an adaptive deep ensemble learning method (TPE-DEM) for dynamic evolving diagnostic task scenarios. Different from previous research that focuses on achieving better performance with a fixed structure model, our proposed model uses TPE to efficiently aggregate simple models more easily understood by physicians and require less training data. In addition, our proposed model can choose the optimal number of layers for the model and the type and number of basic learners to achieve the best performance in different diagnostic task scenarios based on the data distribution and characteristics of the current diagnostic task. We tested our model on one dataset constructed with a partner hospital and five UCI public datasets with different characteristics and volumes based on various diagnostic tasks. Our performance evaluation results show that our proposed model outperforms other baseline models on different datasets. Our study provides a novel approach for simple and understandable machine learning models in tasks with variable datasets and feature sets, and the findings have important implications for the application of machine learning models in computer-aided diagnosis.

Download Full-text