scholarly journals Differentially Private and Fair Classification via Calibrated Functional Mechanism

2020 ◽  
Vol 34 (01) ◽  
pp. 622-629
Author(s):  
Jiahao Ding ◽  
Xinyue Zhang ◽  
Xiaohuan Li ◽  
Junyi Wang ◽  
Rong Yu ◽  
...  

Machine learning is increasingly becoming a powerful tool to make decisions in a wide variety of applications, such as medical diagnosis and autonomous driving. Privacy concerns related to the training data and unfair behaviors of some decisions with regard to certain attributes (e.g., sex, race) are becoming more critical. Thus, constructing a fair machine learning model while simultaneously providing privacy protection becomes a challenging problem. In this paper, we focus on the design of classification model with fairness and differential privacy guarantees by jointly combining functional mechanism and decision boundary fairness. In order to enforce ϵ-differential privacy and fairness, we leverage the functional mechanism to add different amounts of Laplace noise regarding different attributes to the polynomial coefficients of the objective function in consideration of fairness constraint. We further propose an utility-enhancement scheme, called relaxed functional mechanism by adding Gaussian noise instead of Laplace noise, hence achieving (ϵ, δ)-differential privacy. Based on the relaxed functional mechanism, we can design (ϵ, δ)-differentially private and fair classification model. Moreover, our theoretical analysis and empirical results demonstrate that our two approaches achieve both fairness and differential privacy while preserving good utility and outperform the state-of-the-art algorithms.

2020 ◽  
Vol 34 (01) ◽  
pp. 784-791 ◽  
Author(s):  
Qinbin Li ◽  
Zhaomin Wu ◽  
Zeyi Wen ◽  
Bingsheng He

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.


Author(s):  
George Leal Jamil ◽  
Alexis Rocha da Silva

Users' personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it. Users can neither delete nor restrict the purposes for which it is used. Learning how to machine learning that protects privacy, we can make a huge difference in solving many social issues like curing disease, etc. Deep neural networks are susceptible to various inference attacks as they remember information about their training data. In this chapter, the authors introduce differential privacy, which ensures that different kinds of statistical analysis don't compromise privacy and federated learning, training a machine learning model on a data to which we do not have access to.


2021 ◽  
Vol 14 (13) ◽  
pp. 3335-3347
Author(s):  
Daniel Bernau ◽  
Günther Eibl ◽  
Philip W. Grassal ◽  
Hannah Keller ◽  
Florian Kerschbaum

Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters (ϵ, δ ). Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss (ϵ, δ) might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which (ϵ, δ ) are only indirectly related. We transform (ϵ, δ ) to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical (ϵ, δ ).


2022 ◽  
Author(s):  
Maxat Kulmanov ◽  
Robert Hoehndorf

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero


Author(s):  
Xianping Du ◽  
Onur Bilgen ◽  
Hongyi Xu

Abstract Machine learning for classification has been used widely in engineering design, for example, feasible domain recognition and hidden pattern discovery. Training an accurate machine learning model requires a large dataset; however, high computational or experimental costs are major issues in obtaining a large dataset for real-world problems. One possible solution is to generate a large pseudo dataset with surrogate models, which is established with a smaller set of real training data. However, it is not well understood whether the pseudo dataset can benefit the classification model by providing more information or deteriorates the machine learning performance due to the prediction errors and uncertainties introduced by the surrogate model. This paper presents a preliminary investigation towards this research question. A classification-and-regressiontree model is employed to recognize the design subspaces to support design decision-making. It is implemented on the geometric design of a vehicle energy-absorbing structure based on finite element simulations. Based on a small set of real-world data obtained by simulations, a surrogate model based on Gaussian process regression is employed to generate pseudo datasets for training. The results showed that the tree-based method could help recognize feasible design domains efficiently. Furthermore, the additional information provided by the surrogate model enhances the accuracy of classification. One important conclusion is that the accuracy of the surrogate model determines the quality of the pseudo dataset and hence, the improvements in the machine learning model.


2021 ◽  
Vol 14 (6) ◽  
pp. 997-1005
Author(s):  
Sandeep Tata ◽  
Navneet Potti ◽  
James B. Wendt ◽  
Lauro Beltrão Costa ◽  
Marc Najork ◽  
...  

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.


2022 ◽  
pp. 1559-1575
Author(s):  
Mário Pereira Véstias

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.


2020 ◽  
Vol 35 (Supplement_3) ◽  
Author(s):  
Jerry Yu ◽  
Andrew Long ◽  
Maria Hanson ◽  
Aleetha Ellis ◽  
Michael Macarthur ◽  
...  

Abstract Background and Aims There are many benefits for performing dialysis at home including more flexibility and more frequent treatments. A possible barrier to election of home therapy (HT) by in-center patients is a lack of adequate HT education. To aid efficient education efforts, a predictive model was developed to help identify patients who are more likely to switch from in-center and succeed on HT. Method We developed a model using machine learning to predict which patients who are treated in-center without prior HT history are most likely to switch to HT in the next 90 days and stay on HT for at least 90 days. Training data was extracted from 2016–2019 for approximately 300,000 patients. We randomly sampled one in-center treatment date per patient and determined if the patient would switch and succeed on HT. The input features consisted of treatment vitals, laboratories, absence history, comprehensive assessments, facility information, county-level housing, and patient characteristics. Patients were excluded if they had less than 30 days on dialysis due to lack of data. A machine learning model (XGBoost classifier) was deployed monthly in a pilot with a team of HT educators to investigate the model’s utility for identifying HT candidates. Results There were approximately 1,200 patients starting a home therapy per month in a large dialysis provider, with approximately one-third being in-center patients. The prevalence of switching and succeeding to HT in this population was 2.54%. The predictive model achieved an area under the curve of 0.87, sensitivity of 0.77, and a specificity of 0.80 on a hold-out test dataset. The pilot was successfully executed for several months and two major lessons were learned: 1) some patients who reappeared on each month’s list should be removed from the list after expressing no interest in HT, and 2) a data collection mechanism should be put in place to capture the reasons why patients are not interested in HT. Conclusion This quality-improvement initiative demonstrates that predictive modeling can be used to identify patients likely to switch and succeed on home therapy. Integration of the model in existing workflows requires creating a feedback loop which can help improve future worklists.


Author(s):  
Gustavo Assunção ◽  
Paulo Menezes ◽  
Fernando Perdigão

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>The idea of recognizing human emotion through speech (SER) has recently received considerable attention from the research community, mostly due to the current machine learning trend. Nevertheless, even the most successful methods are still rather lacking in terms of adaptation to specific speakers and scenarios, evidently reducing their performance when compared to humans. In this paper, we evaluate a largescale machine learning model for classification of emotional states. This model has been trained for speaker iden- tification but is instead used here as a front-end for extracting robust features from emotional speech. We aim to verify that SER improves when some speak- er</span><span>’</span><span>s emotional prosody cues are considered. Experiments using various state-of- the-art classifiers are carried out, using the Weka software, so as to evaluate the robustness of the extracted features. Considerable improvement is observed when comparing our results with other SER state-of-the-art techniques.</span></p></div></div></div>


2021 ◽  
Vol 2022 (1) ◽  
pp. 460-480
Author(s):  
Bogdan Kulynych ◽  
Mohammad Yaghini ◽  
Giovanni Cherubin ◽  
Michael Veale ◽  
Carmela Troncoso

Abstract A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model’s training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability by naïvely applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding significant evidence of disparate vulnerability in realistic settings.


Sign in / Sign up

Export Citation Format

Share Document