BATCH MODE ACTIVE LEARNING FOR GRAPH-BASED SEMI-SUPERVISED LEARNING

In semi-supervised learning, when the number of data samples with class label information is very small, information from unlabeled data is utilized in the learning process. Many semi-supervised learning methods have been presented and have exhibited competent performance. Active learning also aims to overcome the shortage of labeled data by obtaining class labels for some selected unlabeled data from experts. However, the selection process for the most informative unlabeled data samples can be demanding when the search is performed over a large set of unlabeled data. In this paper, we propose a method for batch mode active learning in graph-based semi-supervised learning. Instead of acquiring class label information of one unlabeled data sample at a time, we obtain information about several data samples at once, reducing time complexity while preserving the beneficial effects of active learning. Experimental results demonstrate the improved performance of the proposed method.

Download Full-text

Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

Entropy ◽

10.3390/e21100988 ◽

2019 ◽

Vol 21 (10) ◽

pp. 988 ◽

Cited By ~ 4

Author(s):

Fazakis ◽

Kanas ◽

Aridas ◽

Karlos ◽

Kotsiantis

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Unlabeled Data ◽

Classification Algorithms ◽

Training Phase ◽

Learning Methods ◽

Training Scheme ◽

Wide Range ◽

Benchmark Datasets ◽

Scientific Fields

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Download Full-text

DEVELOPMENT AND COMPARATIVE ANALYSIS OF SEMI-SUPERVISED LEARNING ALGORITHMS ON A SMALL AMOUNT OF LABELED DATA

Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies ◽

10.20998/2079-0023.2021.01.16 ◽

2021 ◽

pp. 98-103

Author(s):

Klym Yamkovyi

Keyword(s):

Supervised Learning ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Center Of Mass ◽

Unlabeled Data ◽

Learning Approaches ◽

Classification Problems ◽

K Nearest Neighbor ◽

Supervised Learning Algorithms ◽

Label Information

The paper is dedicated to the development and comparative experimental analysis of semi-supervised learning approaches based on a mix of unsupervised and supervised approaches for the classification of datasets with a small amount of labeled data, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. The goal is semi-supervised methods development and analysis along with comparing their accuracy and robustness on different synthetics datasets. The proposed approach is based on the unsupervised K-medoids methods, also known as the Partitioning Around Medoid algorithm, however, unlike Kmedoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classes – assign labels of nearest medoid. Another proposed approach is the mix of the supervised method of K-nearest neighbor and unsupervised K-Means. Thus, the proposed learning algorithm uses information about both the nearest points and classes centers of mass. The methods have been implemented using Python programming language and experimentally investigated for solving classification problems using datasets with different distribution and spatial characteristics. Datasets were generated using the scikit-learn library. Was compared the developed approaches to find average accuracy on all these datasets. It was shown, that even small amounts of labeled data allow us to use semi-supervised learning, and proposed modifications ensure to improve accuracy and algorithm performance, which was demonstrated during experiments. And with the increase of available label information accuracy of the algorithms grows up. Thus, the developed algorithms are using a distance metric that considers available label information. Keywords: Unsupervised learning, supervised learning. semi-supervised learning, clustering, distance, distance function, nearest neighbor, medoid, center of mass.

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch272 ◽

2011 ◽

pp. 1787-1793

Author(s):

Tobias Scheffer

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Error Rate ◽

Learning Algorithms ◽

Gaussian Mixture ◽

Unlabeled Data ◽

Training Data ◽

Support Vector ◽

Classification Problems ◽

Class Labels

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data. Semi-supervised (for an example, see Seeger, 2001) has a long tradition in statistics (Cooper & Freeman, 1970); much early work has focused on Bayesian discrimination of Gaussians. The Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is the most popular method for learning generative models from labeled and unlabeled data. Model-based, generative learning algorithms find model parameters (e.g., the parameters of a Gaussian mixture model) that best explain the available labeled and unlabeled data, and they derive the discriminating classification hypothesis from this model. In discriminative learning, unlabeled data is typically incorporated via the integration of some model assumption into the discriminative framework (Miller & Uyar, 1997; Titterington, Smith, & Makov, 1985). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification results in a bias toward placing the hyperplane in regions of low density p(x). Recently, studies have covered graph-based approaches that rely on the assumption that neighboring instances are more likely to belong to the same class than remote instances (Blum & Chawla, 2001). A distinct approach to utilizing unlabeled data has been proposed by de Sa (1994), Yarowsky (1995) and Blum and Mitchell (1998). When the available attributes can be split into independent and compatible subsets, then multi-view learning algorithms can be employed. Multi-view algorithms, such as co-training (Blum & Mitchell, 1998) and co-EM (Nigam & Ghani, 2000), learn two independent hypotheses, which bootstrap by providing each other with labels for the unlabeled data. An analysis of why training two independent hypotheses that provide each other with conjectured class labels for unlabeled data might be better than EM-like self-training has been provided by Dasgupta, Littman, and McAllester (2001) and has been simplified by Abney (2002). The disagreement rate of two independent hypotheses is an upper bound on the error rate of either hypothesis. Multi-view algorithms minimize the disagreement rate between the peer hypotheses (a situation that is most apparent for the algorithm of Collins & Singer, 1999) and thereby the error rate. Semi-supervised learning is related to active learning. Active learning algorithms are able to actively query the class labels of unlabeled data. By contrast, semi-supervised algorithms are bound to learn from the given data.

Download Full-text

Combining active learning and graph-based semi-supervised learning

10.5753/eniac.2019.9326 ◽

2019 ◽

Author(s):

Jhonatan Candao ◽

Lilian Berton

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Unlabeled Data ◽

Experimental Results ◽

Huge Amount ◽

Label Data ◽

Small Set

The scarcity of labeled data is a common problem in many applications. Semi-supervised learning (SSL) aims to minimize the need for human annotation combining a small set of label data with a huge amount of unlabeled data. Similarly to SSL, Active Learning (AL) reduces the annotation efforts selecting the most informative points for annotation. Few works explore AL and graph-based SSL, in this work, we combine both strategies and explore different techniques: two graph-based SSL and two query strategy of AL in a pool-based scenario. Experimental results in artificial and real datasets indicate that our approach requires significantly less labeled instances to reach the same performance of random label selection.

Download Full-text

Performance analysis of the LapRSSLG algorithm in learning theory

Analysis and Applications ◽

10.1142/s0219530519410033 ◽

2019 ◽

Vol 18 (01) ◽

pp. 79-108

Author(s):

Baohuai Sheng ◽

Haizhang Zhang

Keyword(s):

Performance Analysis ◽

Supervised Learning ◽

Learning Theory ◽

Prediction Performance ◽

Learning Rate ◽

Unlabeled Data ◽

Explicit Learning ◽

Large Set ◽

Rate Estimate ◽

Adjacency Graph

It is known that one aim of semi-supervised learning is to improve the prediction performance using a few labeled data with a large set of unlabeled data. Recently, a Laplacian regularized semi-supervised learning gradient (LapRSSLG) algorithm associated with data adjacency graph edge weights is proposed in the literature. The algorithm receives success in applications, but there is no theory on the performance analysis. In this paper, an explicit learning rate estimate for the algorithm is provided, which shows that the convergence is indeed controlled by the unlabeled data.

Download Full-text

Matrix Completion for Graph-Based Deep Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015058 ◽

2019 ◽

Vol 33 ◽

pp. 5058-5065 ◽

Cited By ~ 10

Author(s):

Fariborz Taherkhani ◽

Hadi Kazemi ◽

Nasser M. Nasrabadi

Keyword(s):

Supervised Learning ◽

Matrix Completion ◽

Unlabeled Data ◽

Classification Problems ◽

Class Label ◽

Data Points ◽

Small Set ◽

Similarity Graph ◽

Deep Learning Model ◽

The Web

Convolutional Neural Networks (CNNs) have provided promising achievements for image classification problems. However, training a CNN model relies on a large number of labeled data. Considering the vast amount of unlabeled data available on the web, it is important to make use of these data in conjunction with a small set of labeled data to train a deep learning model. In this paper, we introduce a new iterative Graph-based Semi-Supervised Learning (GSSL) method to train a CNN-based classifier using a large amount of unlabeled data and a small amount of labeled data. In this method, we first construct a similarity graph in which the nodes represent the CNN features corresponding to data points (labeled and unlabeled) while the edges tend to connect the data points with the same class label. In this graph, the missing label of unsupervised nodes is predicted by using a matrix completion method based on rank minimization criterion. In the next step, we use the constructed graph to calculate triplet regularization loss which is added to the supervised loss obtained by initially labeled data to update the CNN network parameters.

Download Full-text

Active Learning with Query Generation for Cost-Effective Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6133 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6583-6590

Author(s):

Yi-Fan Yan ◽

Sheng-Jun Huang ◽

Shaoyi Chen ◽

Meng Liao ◽

Jin Xu

Keyword(s):

Active Learning ◽

Text Classification ◽

Large Scale ◽

Cost Effective ◽

Classification Performance ◽

Unlabeled Data ◽

Classification Model ◽

Class Label ◽

Text Document ◽

Query Generation

Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.

Download Full-text

Drift Compensation on Massive Online Electronic-Nose Responses

Chemosensors ◽

10.3390/chemosensors9040078 ◽

2021 ◽

Vol 9 (4) ◽

pp. 78

Author(s):

Jianhua Cao ◽

Tao Liu ◽

Jianjun Chen ◽

Tao Yang ◽

Xiuxiu Zhu ◽

...

Keyword(s):

Active Learning ◽

Gas Sensor ◽

Electronic Nose ◽

Effective Means ◽

Learning Paradigm ◽

Class Label ◽

Learning Framework ◽

Sensor Drift ◽

Drift Compensation ◽

Noisy Labels

Gas sensor drift is an important issue of electronic nose (E-nose) systems. This study follows this concern under the condition that requires an instant drift compensation with massive online E-nose responses. Recently, an active learning paradigm has been introduced to such condition. However, it does not consider the “noisy label” problem caused by the unreliability of its labeling process in real applications. Thus, we have proposed a class-label appraisal methodology and associated active learning framework to assess and correct the noisy labels. To evaluate the performance of the proposed methodologies, we used the datasets from two E-nose systems. The experimental results show that the proposed methodology helps the E-noses achieve higher accuracy with lower computation than the reference methods do. Finally, we can conclude that the proposed class-label appraisal mechanism is an effective means of enhancing the robustness of active learning-based E-nose drift compensation.

Download Full-text