Combining Active Learning and Semi-Supervised Learning by Using Selective Label Spreading

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

Download Full-text

Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

Entropy ◽

10.3390/e21100988 ◽

2019 ◽

Vol 21 (10) ◽

pp. 988 ◽

Cited By ~ 4

Author(s):

Fazakis ◽

Kanas ◽

Aridas ◽

Karlos ◽

Kotsiantis

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Unlabeled Data ◽

Classification Algorithms ◽

Training Phase ◽

Learning Methods ◽

Training Scheme ◽

Wide Range ◽

Benchmark Datasets ◽

Scientific Fields

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Download Full-text

A GA-Based Multi-View, Multi-Learner Active Learning Framework for Hyperspectral Image Classification

Remote Sensing ◽

10.3390/rs12020297 ◽

2020 ◽

Vol 12 (2) ◽

pp. 297 ◽

Cited By ~ 3

Author(s):

Nasehe Jamshidpour ◽

Abdolreza Safari ◽

Saeid Homayouni

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Accurate Estimation ◽

Data Sets ◽

Learning Framework ◽

Spatial Graph ◽

Active Learning Method ◽

Significant Superiority

This paper introduces a novel multi-view multi-learner (MVML) active learning method, in which the different views are generated by a genetic algorithm (GA). The GA-based view generation method attempts to construct diverse, sufficient, and independent views by considering both inter- and intra-view confidences. Hyperspectral data inherently owns high dimensionality, which makes it suitable for multi-view learning algorithms. Furthermore, by employing multiple learners at each view, a more accurate estimation of the underlying data distribution can be obtained. We also implemented a spectral-spatial graph-based semi-supervised learning (SSL) method as the classifier, which improved the performance of the classification task in comparison with supervised learning. The evaluation of the proposed method was based on three different benchmark hyperspectral data sets. The results were also compared with other state-of-the-art AL-SSL methods. The experimental results demonstrated the efficiency and statistically significant superiority of the proposed method. The GA-MVML AL method improved the classification performances by 16.68%, 18.37%, and 15.1% for different data sets after 40 iterations.

Download Full-text

Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning

Chemical Science ◽

10.1039/c9sc00616h ◽

2019 ◽

Vol 10 (35) ◽

pp. 8154-8163 ◽

Cited By ~ 14

Author(s):

Yao Zhang ◽

Alpha A. Lee

Keyword(s):

Machine Learning ◽

Active Learning ◽

Supervised Learning ◽

Molecular Properties ◽

Learning Models ◽

Molecular Properties Prediction ◽

Design Experiments ◽

Machine Learning Models

We report a statistically principled method to quantify the uncertainty of machine learning models for molecular properties prediction. We show that this uncertainty estimate can be used to judiciously design experiments.

Download Full-text

Medical Image Retrieval Based on Semi-Supervised Learning

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.108-111.201 ◽

2010 ◽

Vol 108-111 ◽

pp. 201-206 ◽

Cited By ~ 1

Author(s):

Hui Liu ◽

Cai Ming Zhang ◽

Hua Han

Keyword(s):

Support Vector Machine ◽

Active Learning ◽

Image Retrieval ◽

Supervised Learning ◽

Relevance Feedback ◽

Medical Image ◽

Content Based Image Retrieval ◽

Support Vector ◽

Learning Support ◽

Medical Image Retrieval

Among various content-based image retrieval (CBIR) methods based on active learning, support vector machine(SVM) active learning is popular for its application to relevance feedback in CBIR. However, the regular SVM active learning has two main drawbacks when used for relevance feedback. Furthermore, it’s difficult to collect vast amounts of labeled data and easy for unlabeled data to image examples. Therefore, it is necessary to define conditions to utilize the unlabeled examples enough. This paper presented a method of medical images retrieval about semi-supervised learning based on SVM for relevance feedback in CBIR. This paper also introduced an algorithm about defining two learners, both learners are re-trained after every relevance feedback round, and then each of them gives every image in a rank. Experiments show that using semi-supervised learning idea in CBIR is beneficial, and the proposed method achieves better performance than some existing methods.

Download Full-text