scikit-activeml: A Library and Toolbox for Active Learning Algorithms

Machine learning applications often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data points that will improve the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels. We provide a library called scikit-activeml that covers the most relevant query strategies and implements tools to work with partially labeled data. It is programmed in Python and builds on top of scikit-learn.

Download Full-text

Increasing the Accuracy of Predictive Algorithms

Encyclopedia of Information Science and Technology, Second Edition ◽

10.4018/978-1-60566-026-4.ch300 ◽

2011 ◽

pp. 1906-1910

Author(s):

Sotiris Kotsiantis ◽

Dimitris Kanellopoulos ◽

Panayotis Pintelas

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Bayesian Networks ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Data Sets ◽

Combining Classifiers ◽

Predictive Algorithms

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected tone can learn a way of classifying unseen examples (see Table 1). Formally, the problem can be stated as follows: Given training data {(x1, y1)…(xn, yn)}, produce a classifier h: X- >Y that maps an object x ? X to its classification label y ? Y. A large number of classification techniques have been developed based on artificial intelligence (logic-based techniques, perception-based techniques) and statistics (Bayesian networks, instance-based techniques). No single learning algorithm can uniformly outperform other algorithms over all data sets. The concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual machine learning algorithms. Numerous methods have been suggested for the creation of ensembles of classi- fiers (Dietterich, 2000). Although, or perhaps because, many methods of ensemble creation have been proposed, there is as yet no clear picture of which method is best.

Download Full-text

AutoDAL: Distributed Active Learning with Automatic Hyperparameter Selection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5759 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3537-3544

Author(s):

Xu Chen ◽

Brett Wujek

Keyword(s):

Machine Learning ◽

Active Learning ◽

Supervised Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Learning System ◽

Automated Learning ◽

Benchmark Datasets ◽

Hyperparameter Selection ◽

Query Selection

Automated machine learning (AutoML) strives to establish an appropriate machine learning model for any dataset automatically with minimal human intervention. Although extensive research has been conducted on AutoML, most of it has focused on supervised learning. Research of automated semi-supervised learning and active learning algorithms is still limited. Implementation becomes more challenging when the algorithm is designed for a distributed computing environment. With this as motivation, we propose a novel automated learning system for distributed active learning (AutoDAL) to address these challenges. First, automated graph-based semi-supervised learning is conducted by aggregating the proposed cost functions from different compute nodes in a distributed manner. Subsequently, automated active learning is addressed by jointly optimizing hyperparameters in both the classification and query selection stages leveraging the graph loss minimization and entropy regularization. Moreover, we propose an efficient distributed active learning algorithm which is scalable for big data by first partitioning the unlabeled data and replicating the labeled data to different worker nodes in the classification stage, and then aggregating the data in the controller in the query selection stage. The proposed AutoDAL algorithm is applied to multiple benchmark datasets and a real-world electrocardiogram (ECG) dataset for classification. We demonstrate that the proposed AutoDAL algorithm is capable of achieving significantly better performance compared to several state-of-the-art AutoML approaches and active learning algorithms.

Download Full-text

Automated Angiographic Labeling Pipeline

Proceedings of IMPRS ◽

10.18060/25890 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Jacob Cantrell ◽

Kolten Kersey ◽

Anush Motaganahalli ◽

Amy Li ◽

Hunter Maxwell ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Learning Algorithm ◽

Axial Flow ◽

Training Data ◽

Convenience Sample ◽

Hybrid Approaches ◽

Machine Learning Approach ◽

Picture Archiving And Communication ◽

Artery Disease

Background & Hypothesis: Treatment decisions for medical management, endovascular therapy, open surgery, and hybrid approaches for peripheral artery disease (PAD) are largely driven by imaging. While catheter-directed angiography remains the gold-standard for endoluminal vessel analysis, currently, there is not widespread clinical use of machine learning to provide automated segmentation. This project aims to develop an active learning pipeline to automate the labeling of vascular structures in angiographic images. Methods: We queried the picture archiving and communication system (PACS) database for Indiana University Health and Eskenazi Health to identify studies with catheter-directed angiograms of the extremities. From this dataset we randomly selected an initial convenience sample of 50 angiograms to manually label using the 3D Slicer software. We compared three workflow approaches for labeling this training data - (1) human-only single-pass labelling whereby one person labels each image; (2) human-only multi-pass labelling whereby three humans label a vessel with increasing precision; (3) “human-in-the-middle” approach using NVIDIA’s AI-Assisted Annotation client whereby the image is auto-segmented and then manually checked for accuracy. Results: We are currently evaluating speed and accuracy for each of these approaches. However, our preliminary data suggests that human-only multi-pass labeling is most efficientappreac We will be validating the following three-step process. First, thresholding tool was used to leverage differences in contrast gradations to approximate the location of vascular structure. Second, the eraser tool was utilized to refine the vessel boundaries. Finally, major blood vessels contributing to axial flow to the foot were manually labeled. These labeled angiograms will be used to develop an active learning algorithm to automate future labeling of the remaining dataset. Conclusion: A machine learning approach to interpreting lower extremity images can dramatically improve the efficiency of triaging patients with PAD. Further work is underway to develop and implement this program clinically.

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch272 ◽

2011 ◽

pp. 1787-1793

Author(s):

Tobias Scheffer

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Error Rate ◽

Learning Algorithms ◽

Gaussian Mixture ◽

Unlabeled Data ◽

Training Data ◽

Support Vector ◽

Classification Problems ◽

Class Labels

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data. Semi-supervised (for an example, see Seeger, 2001) has a long tradition in statistics (Cooper & Freeman, 1970); much early work has focused on Bayesian discrimination of Gaussians. The Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is the most popular method for learning generative models from labeled and unlabeled data. Model-based, generative learning algorithms find model parameters (e.g., the parameters of a Gaussian mixture model) that best explain the available labeled and unlabeled data, and they derive the discriminating classification hypothesis from this model. In discriminative learning, unlabeled data is typically incorporated via the integration of some model assumption into the discriminative framework (Miller & Uyar, 1997; Titterington, Smith, & Makov, 1985). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification results in a bias toward placing the hyperplane in regions of low density p(x). Recently, studies have covered graph-based approaches that rely on the assumption that neighboring instances are more likely to belong to the same class than remote instances (Blum & Chawla, 2001). A distinct approach to utilizing unlabeled data has been proposed by de Sa (1994), Yarowsky (1995) and Blum and Mitchell (1998). When the available attributes can be split into independent and compatible subsets, then multi-view learning algorithms can be employed. Multi-view algorithms, such as co-training (Blum & Mitchell, 1998) and co-EM (Nigam & Ghani, 2000), learn two independent hypotheses, which bootstrap by providing each other with labels for the unlabeled data. An analysis of why training two independent hypotheses that provide each other with conjectured class labels for unlabeled data might be better than EM-like self-training has been provided by Dasgupta, Littman, and McAllester (2001) and has been simplified by Abney (2002). The disagreement rate of two independent hypotheses is an upper bound on the error rate of either hypothesis. Multi-view algorithms minimize the disagreement rate between the peer hypotheses (a situation that is most apparent for the algorithm of Collins & Singer, 1999) and thereby the error rate. Semi-supervised learning is related to active learning. Active learning algorithms are able to actively query the class labels of unlabeled data. By contrast, semi-supervised algorithms are bound to learn from the given data.

Download Full-text

Machine Learning Technology Overview In Terms Of Digital Marketing And Personalization

ECMS 2021 Proceedings edited by Khalid Al-Begain, Mauro Iacono, Lelio Campanile, Andrzej Bargiela ◽

10.7148/2021-0125 ◽

2021 ◽

Author(s):

Anna Nikolajeva ◽

Artis Teilans

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Digital Marketing ◽

Learning Technology ◽

Estimation Problems ◽

Artificial Intelligence Technology

The research is dedicated to artificial intelligence technology usage in digital marketing personalization. The doctoral theses will aim to create a machine learning algorithm that will increase sales by personalized marketing in electronic commerce website. Machine learning algorithms can be used to find the unobservable probability density function in density estimation problems. Learning algorithms learn on their own based on previous experience and generate their sequences of learning experiences, to acquire new skills through self-guided exploration and social interaction with humans. An entirely personalized advertising experience can be a reality in the nearby future using learning algorithms with training data and new behaviour patterns appearance using unsupervised learning algorithms. Artificial intelligence technology will create website specific adverts in all sales funnels individually.

Download Full-text

Robustness Verification of Quantum Classifiers

Computer Aided Verification - Lecture Notes in Computer Science ◽

10.1007/978-3-030-81685-8_7 ◽

2021 ◽

pp. 151-174

Author(s):

Ji Guan ◽

Wang Fang ◽

Mingsheng Ying

Keyword(s):

Machine Learning ◽

Quantum Physics ◽

Quantum Noise ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Practical Implementation ◽

Quantum Machine Learning ◽

Phase Recognition

AbstractSeveral important models of machine learning algorithms have been successfully generalized to the quantum world, with potential speedup to training classical classifiers and applications to data analytics in quantum physics that can be implemented on the near future quantum computers. However, quantum noise is a major obstacle to the practical implementation of quantum machine learning. In this work, we define a formal framework for the robustness verification and analysis of quantum machine learning algorithms against noises. A robust bound is derived and an algorithm is developed to check whether or not a quantum machine learning algorithm is robust with respect to quantum training data. In particular, this algorithm can find adversarial examples during checking. Our approach is implemented on Google’s TensorFlow Quantum and can verify the robustness of quantum machine learning algorithms with respect to a small disturbance of noises, derived from the surrounding environment. The effectiveness of our robust bound and algorithm is confirmed by the experimental results, including quantum bits classification as the “Hello World” example, quantum phase recognition and cluster excitation detection from real world intractable physical problems, and the classification of MNIST from the classical world.

Download Full-text

Intelligent system of English composition scoring model based on improved machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189235 ◽

2020 ◽

pp. 1-11

Author(s):

Jie Liu ◽

Lin Lin ◽

Xiufang Liang

Keyword(s):

Machine Learning ◽

Evaluation System ◽

Intelligent System ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Assessment System ◽

English Composition ◽

Region Extraction ◽

Constraint Model

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Smooth input preparation for quantum and quantum-inspired machine learning

Quantum Machine Intelligence ◽

10.1007/s42484-021-00045-x ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Zhikuan Zhao ◽

Jack K. Fitzsimons ◽

Patrick Rebentrost ◽

Vedran Dunjko ◽

Joseph F. Fitzsimons

Keyword(s):

Machine Learning ◽

Quantum Algorithms ◽

Machine Learning Algorithms ◽

Low Rank ◽

Smoothed Analysis ◽

Machine Learning Applications ◽

Data Points ◽

Input Model ◽

The Cost ◽

Input Perturbation

AbstractMachine learning has recently emerged as a fruitful area for finding potential quantum computational advantage. Many of the quantum-enhanced machine learning algorithms critically hinge upon the ability to efficiently produce states proportional to high-dimensional data points stored in a quantum accessible memory. Even given query access to exponentially many entries stored in a database, the construction of which is considered a one-off overhead, it has been argued that the cost of preparing such amplitude-encoded states may offset any exponential quantum advantage. Here we prove using smoothed analysis that if the data analysis algorithm is robust against small entry-wise input perturbation, state preparation can always be achieved with constant queries. This criterion is typically satisfied in realistic machine learning applications, where input data is subjective to moderate noise. Our results are equally applicable to the recent seminal progress in quantum-inspired algorithms, where specially constructed databases suffice for polylogarithmic classical algorithm in low-rank cases. The consequence of our finding is that for the purpose of practical machine learning, polylogarithmic processing time is possible under a general and flexible input model with quantum algorithms or quantum-inspired classical algorithms in the low-rank cases.

Download Full-text