Combining Active Learning and Semi-Supervised Learning by Using Selective Label Spreading

Author(s):  
Xu Chen ◽  
Tao Wang
2014 ◽  
Vol 41 (5) ◽  
pp. 2372-2378 ◽  
Author(s):  
Yihao Zhang ◽  
Junhao Wen ◽  
Xibin Wang ◽  
Zhuo Jiang

Author(s):  
Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.


Entropy ◽  
2019 ◽  
Vol 21 (10) ◽  
pp. 988 ◽  
Author(s):  
Fazakis ◽  
Kanas ◽  
Aridas ◽  
Karlos ◽  
Kotsiantis

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.


2020 ◽  
Vol 12 (2) ◽  
pp. 297 ◽  
Author(s):  
Nasehe Jamshidpour ◽  
Abdolreza Safari ◽  
Saeid Homayouni

This paper introduces a novel multi-view multi-learner (MVML) active learning method, in which the different views are generated by a genetic algorithm (GA). The GA-based view generation method attempts to construct diverse, sufficient, and independent views by considering both inter- and intra-view confidences. Hyperspectral data inherently owns high dimensionality, which makes it suitable for multi-view learning algorithms. Furthermore, by employing multiple learners at each view, a more accurate estimation of the underlying data distribution can be obtained. We also implemented a spectral-spatial graph-based semi-supervised learning (SSL) method as the classifier, which improved the performance of the classification task in comparison with supervised learning. The evaluation of the proposed method was based on three different benchmark hyperspectral data sets. The results were also compared with other state-of-the-art AL-SSL methods. The experimental results demonstrated the efficiency and statistically significant superiority of the proposed method. The GA-MVML AL method improved the classification performances by 16.68%, 18.37%, and 15.1% for different data sets after 40 iterations.


2019 ◽  
Vol 10 (35) ◽  
pp. 8154-8163 ◽  
Author(s):  
Yao Zhang ◽  
Alpha A. Lee

We report a statistically principled method to quantify the uncertainty of machine learning models for molecular properties prediction. We show that this uncertainty estimate can be used to judiciously design experiments.


2010 ◽  
Vol 108-111 ◽  
pp. 201-206 ◽  
Author(s):  
Hui Liu ◽  
Cai Ming Zhang ◽  
Hua Han

Among various content-based image retrieval (CBIR) methods based on active learning, support vector machine(SVM) active learning is popular for its application to relevance feedback in CBIR. However, the regular SVM active learning has two main drawbacks when used for relevance feedback. Furthermore, it’s difficult to collect vast amounts of labeled data and easy for unlabeled data to image examples. Therefore, it is necessary to define conditions to utilize the unlabeled examples enough. This paper presented a method of medical images retrieval about semi-supervised learning based on SVM for relevance feedback in CBIR. This paper also introduced an algorithm about defining two learners, both learners are re-trained after every relevance feedback round, and then each of them gives every image in a rank. Experiments show that using semi-supervised learning idea in CBIR is beneficial, and the proposed method achieves better performance than some existing methods.


2016 ◽  
Vol 173 ◽  
pp. 1288-1298 ◽  
Author(s):  
Xibin Wang ◽  
Junhao Wen ◽  
Shafiq Alam ◽  
Zhuo Jiang ◽  
Yingbo Wu

Sign in / Sign up

Export Citation Format

Share Document