AutoDAL: Distributed Active Learning with Automatic Hyperparameter Selection

Xu Chen; Brett Wujek

doi:10.1609/aaai.v34i04.5759

AutoDAL: Distributed Active Learning with Automatic Hyperparameter Selection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5759 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3537-3544

Author(s):

Xu Chen ◽

Brett Wujek

Keyword(s):

Machine Learning ◽

Active Learning ◽

Supervised Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Learning System ◽

Automated Learning ◽

Benchmark Datasets ◽

Hyperparameter Selection ◽

Query Selection

Automated machine learning (AutoML) strives to establish an appropriate machine learning model for any dataset automatically with minimal human intervention. Although extensive research has been conducted on AutoML, most of it has focused on supervised learning. Research of automated semi-supervised learning and active learning algorithms is still limited. Implementation becomes more challenging when the algorithm is designed for a distributed computing environment. With this as motivation, we propose a novel automated learning system for distributed active learning (AutoDAL) to address these challenges. First, automated graph-based semi-supervised learning is conducted by aggregating the proposed cost functions from different compute nodes in a distributed manner. Subsequently, automated active learning is addressed by jointly optimizing hyperparameters in both the classification and query selection stages leveraging the graph loss minimization and entropy regularization. Moreover, we propose an efficient distributed active learning algorithm which is scalable for big data by first partitioning the unlabeled data and replicating the labeled data to different worker nodes in the classification stage, and then aggregating the data in the controller in the query selection stage. The proposed AutoDAL algorithm is applied to multiple benchmark datasets and a real-world electrocardiogram (ECG) dataset for classification. We demonstrate that the proposed AutoDAL algorithm is capable of achieving significantly better performance compared to several state-of-the-art AutoML approaches and active learning algorithms.

Download Full-text

Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features

BMC Genomics ◽

10.1186/s12864-020-07033-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Zhixun Zhao ◽

Xiaocai Zhang ◽

Fang Chen ◽

Liang Fang ◽

Jinyan Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Case Studies ◽

Learning Algorithm ◽

State Of The Art ◽

Learning Algorithms ◽

Feature Space ◽

Sequence Features ◽

Independent Test ◽

Benchmark Datasets

Abstract Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.

Download Full-text

Online Active Learning of Reject Option Classifiers

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6019 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5652-5659

Author(s):

Kulin Shah ◽

Naresh Manwani

Keyword(s):

Machine Learning ◽

Active Learning ◽

Supervised Learning ◽

Loss Function ◽

Learning Algorithm ◽

Binary Classification ◽

Experimental Results ◽

Reject Option ◽

Novel Algorithms ◽

Ramp Loss Function

Active learning is an important technique to reduce the number of labeled examples in supervised learning. Active learning for binary classification has been well addressed in machine learning. However, active learning of the reject option classifier remains unaddressed. In this paper, we propose novel algorithms for active learning of reject option classifiers. We develop an active learning algorithm using double ramp loss function. We provide mistake bounds for this algorithm. We also propose a new loss function called double sigmoid loss function for reject option and corresponding active learning algorithm. We offer a convergence guarantee for this algorithm. We provide extensive experimental results to show the effectiveness of the proposed algorithms. The proposed algorithms efficiently reduce the number of label examples required.

Download Full-text

scikit-activeml: A Library and Toolbox for Active Learning Algorithms

10.20944/preprints202103.0194.v1 ◽

2021 ◽

Author(s):

Daniel Kottke ◽

Marek Herde ◽

Tuan Pham Minh ◽

Alexander Benz ◽

Pascal Mergard ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Unlabeled Data ◽

Training Data ◽

Partially Labeled Data ◽

Difficult Time ◽

Machine Learning Applications ◽

Data Points

Machine learning applications often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data points that will improve the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels. We provide a library called scikit-activeml that covers the most relevant query strategies and implements tools to work with partially labeled data. It is programmed in Python and builds on top of scikit-learn.

Download Full-text

Towards personalized guidelines: using machine-learning algorithms to guide antimicrobial selection

Journal of Antimicrobial Chemotherapy ◽

10.1093/jac/dkaa222 ◽

2020 ◽

Vol 75 (9) ◽

pp. 2677-2680 ◽

Cited By ~ 1

Author(s):

Ed Moran ◽

Esther Robinson ◽

Christopher Green ◽

Matt Keeling ◽

Benjamin Collyer

Keyword(s):

Machine Learning ◽

Open Source ◽

Learning Algorithm ◽

Bacterial Species ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Learning System ◽

Machine Learning Algorithm ◽

Gram Negative ◽

The Impact

Abstract Background Electronic decision support systems could reduce the use of inappropriate or ineffective empirical antibiotics. We assessed the accuracy of an open-source machine-learning algorithm trained in predicting antibiotic resistance for three Gram-negative bacterial species isolated from patients’ blood and urine within 48 h of hospital admission. Methods This retrospective, observational study used routine clinical information collected between January 2010 and October 2016 in Birmingham, UK. Patients from whose blood or urine cultures Escherichia coli, Klebsiella pneumoniae or Pseudomonas aeruginosa was isolated were identified. Their demographic, microbiology and prescribing data were used to train an open-source machine-learning algorithm—XGBoost—in predicting resistance to co-amoxiclav and piperacillin/tazobactam. Multivariate analysis was performed to identify predictors of resistance and create a point-scoring tool. The performance of both methods was compared with that of the original prescribers. Results There were 15 695 admissions. The AUC of the receiver operating characteristic curve for the point-scoring tools ranged from 0.61 to 0.67, and performed no better than medical staff in the selection of appropriate antibiotics. The machine-learning system performed statistically but marginally better (AUC 0.70) and could have reduced the use of unnecessary broad-spectrum antibiotics by as much as 40% among those given co-amoxiclav, piperacillin/tazobactam or carbapenems. A validation study is required. Conclusions Machine-learning algorithms have the potential to help clinicians predict antimicrobial resistance in patients found to have a Gram-negative infection of blood or urine. Prospective studies are required to assess performance in an unselected patient cohort, understand the acceptability of such systems to clinicians and patients, and assess the impact on patient outcome.

Download Full-text

Towards Automated Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014237 ◽

2019 ◽

Vol 33 ◽

pp. 4237-4244 ◽

Cited By ~ 1

Author(s):

Yu-Feng Li ◽

Hai Wang ◽

Tong Wei ◽

Wei-Wei Tu

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Semisupervised Learning ◽

Learning System ◽

Large Margin ◽

Automated Learning ◽

Machine Learning Model ◽

Meta Learning ◽

Performance Deterioration ◽

Automated Machine Learning

Automated Machine Learning (AutoML) aims to build an appropriate machine learning model for any unseen dataset automatically, i.e., without human intervention. Great efforts have been devoted on AutoML while they typically focus on supervised learning. In many applications, however, semisupervised learning (SSL) are widespread and current AutoML systems could not well address SSL problems. In this paper, we propose to present an automated learning system for SSL (AUTO-SSL). First, meta-learning with enhanced meta-features is employed to quickly suggest some instantiations of the SSL techniques which are likely to perform quite well. Second, a large margin separation method is proposed to fine-tune the hyperparameters and more importantly, alleviate performance deterioration. The basic idea is that, if a certain hyperparameter owns a high quality, its predictive results on unlabeled data may have a large margin separation. Extensive empirical results over 200 cases demonstrate that our proposal on one side achieves highly competitive or better performance compared to the state-of-the-art AutoML system AUTO-SKLEARN and classical SSL techniques, on the other side unlike classical SSL techniques which often significantly degenerate performance, our proposal seldom suffers from such deficiency.

Download Full-text

Geometric morphometrics and machine learning challenge currently accepted species limits of the land snail Placostylus (Pulmonata: Bothriembryontidae) on the Isle of Pines, New Caledonia

Journal of Molluscan Studies ◽

10.1093/mollus/eyz031 ◽

2020 ◽

Vol 86 (1) ◽

pp. 35-41

Author(s):

Mathieu Quenu ◽

Steven A Trewick ◽

Fabrice Brescia ◽

Mary Morgan-Richards

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

New Caledonia ◽

Learning Algorithm ◽

Learning Algorithms ◽

Land Snail ◽

Machine Learning Algorithms ◽

Snail Species ◽

Size And Shape

Abstract Size and shape variations of shells can be used to identify natural phenotypic clusters and thus delimit snail species. Here, we apply both supervised and unsupervised machine learning algorithms to a geometric morphometric dataset to investigate size and shape variations of the shells of the endemic land snail Placostylus from New Caledonia. We sampled eight populations of Placostylus from the Isle of Pines, where two species of this genus reportedly coexist. We used neural network analysis as a supervised learning algorithm and Gaussian mixture models as an unsupervised learning algorithm. Using a training dataset of individuals assigned to species using nuclear markers, we found that supervised learning algorithms could not unambiguously classify all individuals of our expanded dataset using shell size and shape. Unsupervised learning showed that the optimal division of our data consisted of three phenotypic clusters. Two of these clusters correspond to the established species Placostylus fibratus and P. porphyrostomus, while the third cluster was intermediate in both shape and size. Most of the individuals that were not clearly classified using supervised learning were classified to this intermediate phenotype by unsupervised learning, and most of these individuals came from previously unsampled populations. These results may indicate the presence of persistent putative-hybrid populations of Placostylus in the Isle of Pines.

Download Full-text

Intelligent system of English composition scoring model based on improved machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189235 ◽

2020 ◽

pp. 1-11

Author(s):

Jie Liu ◽

Lin Lin ◽

Xiufang Liang

Keyword(s):

Machine Learning ◽

Evaluation System ◽

Intelligent System ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Assessment System ◽

English Composition ◽

Region Extraction ◽

Constraint Model

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.

Download Full-text

A State of Art Techniques on Machine Learning Algorithms: A Perspective of Supervised Learning Approaches in Data Classification

2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS) ◽

10.1109/iccons.2018.8663155 ◽

2018 ◽

Cited By ~ 15

Author(s):

R. Saravanan ◽

Pothula Sujatha

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Learning Algorithms ◽

Data Classification ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

State Of Art ◽

Art Techniques

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

Application of a Rough Set-Based Inductive Learning System

Fundamenta Informaticae ◽

10.3233/fi-1993-182-409 ◽

1993 ◽

Vol 18 (2-4) ◽

pp. 209-220

Author(s):

Michael Hadjimichael ◽

Anita Wasilewska

Keyword(s):

Machine Learning ◽

Rough Set ◽

Presidential Election ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Inductive Learning ◽

Real Data ◽

Semantic Content ◽

Learning System ◽

Voter Preferences

We present here an application of Rough Set formalism to Machine Learning. The resulting Inductive Learning algorithm is described, and its application to a set of real data is examined. The data consists of a survey of voter preferences taken during the 1988 presidential election in the U.S.A. Results include an analysis of the predictive accuracy of the generated rules, and an analysis of the semantic content of the rules.

Download Full-text