On the Intrinsic Differential Privacy of Bagging

Differentially private machine learning trains models while protecting privacy of the sensitive training data. The key to obtain differentially private models is to introduce noise/randomness to the training process. In particular, existing differentially private machine learning methods add noise to the training data, the gradients, the loss function, and/or the model itself. Bagging, a popular ensemble learning framework, randomly creates some subsamples of the training data, trains a base model for each subsample using a base learner, and takes majority vote among the base models when making predictions. Bagging has intrinsic randomness in the training process as it randomly creates subsamples. Our major theoretical results show that such intrinsic randomness already makes Bagging differentially private without the needs of additional noise. Moreover, we prove that if no assumptions about the base learner are made, our derived privacy guarantees are tight. We empirically evaluate Bagging on MNIST and CIFAR10. Our experimental results demonstrate that Bagging achieves significantly higher accuracies than state-of-the-art differentially private machine learning methods with the same privacy budgets.

Download Full-text

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Download Full-text

Multi-label pathway prediction based on active dataset subsampling

10.1101/2020.09.14.297424 ◽

2020 ◽

Author(s):

Abdur Rahman M. A. Basher ◽

Steven J. Hallam

Keyword(s):

Machine Learning ◽

Microbial Communities ◽

Performance Metrics ◽

Class Imbalance ◽

Training Data ◽

Great Promise ◽

Biological Organization ◽

Learning Methods ◽

Machine Learning Methods ◽

Pathway Prediction

AbstractMachine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets.Availability and implementationThe software package, and installation instructions are published on github.com/[email protected]

Download Full-text

Machine Learning and Value-Based Software Engineering

Software and Intelligent Sciences ◽

10.4018/978-1-4666-0261-8.ch017 ◽

2012 ◽

pp. 287-301

Author(s):

Du Zhang

Keyword(s):

Machine Learning ◽

Software Engineering ◽

Full Range ◽

Training Data ◽

Learning Methods ◽

Engineering Research ◽

Machine Learning Methods ◽

Machine Learning Applications ◽

Value Propositions ◽

Software Engineering Research

Software engineering research and practice thus far are primarily conducted in a value-neutral setting where each artifact in software development such as requirement, use case, test case, and defect, is treated as equally important during a software system development process. There are a number of shortcomings of such value-neutral software engineering. Value-based software engineering is to integrate value considerations into the full range of existing and emerging software engineering principles and practices. Machine learning has been playing an increasingly important role in helping develop and maintain large and complex software systems. However, machine learning applications to software engineering have been largely confined to the value-neutral software engineering setting. In this paper, the general message to be conveyed is to apply machine learning methods and algorithms to value-based software engineering. The training data or the background knowledge or domain theory or heuristics or bias used by machine learning methods in generating target models or functions should be aligned with stakeholders’ value propositions. An initial research agenda is proposed for machine learning in value-based software engineering.

Download Full-text

TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification

Bioinformatics ◽

10.1093/bioinformatics/btz394 ◽

2019 ◽

Vol 35 (14) ◽

pp. i31-i40 ◽

Cited By ~ 1

Author(s):

Erfan Sayyari ◽

Ban Kawas ◽

Siavash Mirarab

Keyword(s):

Machine Learning ◽

Sample Size ◽

Data Augmentation ◽

Training Data ◽

Supplementary Information ◽

High Dimensional ◽

Learning Methods ◽

Machine Learning Methods ◽

Phenotype Classification ◽

Microbiome Data

Abstract Motivation Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. Results In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. Availability and implementation TADA is available at https://github.com/tada-alg/TADA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of Collapsibility of Loess of Construction Sites in Xining Based on Machine Learning Methods

10.21203/rs.3.rs-307514/v1 ◽

2021 ◽

Author(s):

Qifei Zhao ◽

Xiaojun Li ◽

Yunning Cao ◽

Zhikun Li ◽

Jixin Fan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Training Data ◽

Support Vector ◽

Engineering Practice ◽

Burial Depth ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods ◽

North East

Abstract Collapsibility of loess is a significant factor affecting engineering construction in loess area, and testing the collapsibility of loess is costly. In this study, A total of 4,256 loess samples are collected from the north, east, west and middle regions of Xining. 70% of the samples are used to generate training data set, and the rest are used to generate verification data set, so as to construct and validate the machine learning models. The most important six factors are selected from thirteen factors by using Grey Relational analysis and multicollinearity analysis: burial depth、water content、specific gravity of soil particles、void rate、geostatic stress and plasticity limit. In order to predict the collapsibility of loess, four machine learning methods: Support Vector Machine (SVM), Random Subspace Based Support Vector Machine (RSSVM), Random Forest (RF) and Naïve Bayes Tree (NBTree), are studied and compared. The receiver operating characteristic (ROC) curve indicators, standard error (SD) and 95% confidence interval (CI) are used to verify and compare the models in different research areas. The results show that: RF model is the most efficient in predicting the collapsibility of loess in Xining, and its AUC average is above 80%, which can be used in engineering practice.

Download Full-text

Quantified uncertainties in fission yields from machine learning

EPJ Web of Conferences ◽

10.1051/epjconf/202024205003 ◽

2020 ◽

Vol 242 ◽

pp. 05003

Author(s):

A.E. Lovell ◽

A.T. Mohan ◽

P. Talou ◽

M. Chertkov

Keyword(s):

Machine Learning ◽

Nuclear Physics ◽

Experimental Error ◽

Training Data ◽

Learning Methods ◽

Mixture Density ◽

Machine Learning Methods ◽

Predicted Values ◽

The Impact ◽

Fission Yields

As machine learning methods gain traction in the nuclear physics community, especially those methods that aim to propagate uncertainties to unmeasured quantities, it is important to understand how the uncertainty in the training data coming either from theory or experiment propagates to the uncertainty in the predicted values. Gaussian Processes and Bayesian Neural Networks are being more and more widely used, in particular to extrapolate beyond measured data. However, studies are typically not performed on the impact of the experimental errors on these extrapolated values. In this work, we focus on understanding how uncertainties propagate from input to prediction when using machine learning methods. We use a Mixture Density Network (MDN) to incorporate experimental error into the training of the network and construct uncertainties for the associated predicted quantities. Systematically, we study the effect of the size of the experimental error, both on the reproduced training data and extrapolated predictions for fission yields of actinides.

Download Full-text

Majority vote of machine learning methods for real-time epileptic seizure prediction applied on EEG pediatric data

Neurological Disorders and Imaging Physics, Volume 5 ◽

10.1088/978-0-7503-2723-7ch1 ◽

2020 ◽

Author(s):

Medina Bandic ◽

Jasmin Kevric ◽

Dino Keco ◽

Samed Jukic

Keyword(s):

Machine Learning ◽

Real Time ◽

Epileptic Seizure ◽

Majority Vote ◽

Seizure Prediction ◽

Learning Methods ◽

Machine Learning Methods ◽

Epileptic Seizure Prediction

Download Full-text

Using Machine Learning Methods To Identify Coal Pay Zones from Drilling and Logging-While-Drilling (LWD) Data

SPE Journal ◽

10.2118/198288-pa ◽

2020 ◽

Vol 25 (03) ◽

pp. 1241-1258 ◽

Cited By ~ 2

Author(s):

Ruizhi Zhong ◽

Raymond L. Johnson ◽

Zhongwei Chen

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Well Completion ◽

Machine Learning Methods ◽

Extreme Gradient Boosting ◽

Logging While Drilling

Summary Accurate coal identification is critical in coal seam gas (CSG) (also known as coalbed methane or CBM) developments because it determines well completion design and directly affects gas production. Density logging using radioactive source tools is the primary tool for coal identification, adding well trips to condition the hole and additional well costs for logging runs. In this paper, machine learning methods are applied to identify coals from drilling and logging-while-drilling (LWD) data to reduce overall well costs. Machine learning algorithms include logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF), and extreme gradient boosting (XGBoost). The precision, recall, and F1 score are used as evaluation metrics. Because coal identification is an imbalanced data problem, the performance on the minority class (i.e., coals) is limited. To enhance the performance on coal prediction, two data manipulation techniques [naive random oversampling (NROS) technique and synthetic minority oversampling technique (SMOTE)] are separately coupled with machine learning algorithms. Case studies are performed with data from six wells in the Surat Basin, Australia. For the first set of experiments (single-well experiments), both the training data and test data are in the same well. The machine learning methods can identify coal pay zones for sections with poor or missing logs. It is found that rate of penetration (ROP) is the most important feature. The second set of experiments (multiple-well experiments) uses the training data from multiple nearby wells, which can predict coal pay zones in a new well. The most important feature is gamma ray. After placing slotted casings, all wells have coal identification rates greater than 90%, and three wells have coal identification rates greater than 99%. This indicates that machine learning methods (either XGBoost or ANN/RF with NROS/SMOTE) can be an effective way to identify coal pay zones and reduce coring or logging costs in CSG developments.

Download Full-text