Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers

Recently the field of machine learning, pattern recognition, and data mining has witnessed a new research stream that is <i>learning with partial supervisio</i>n -LPS- (known also as <i>semi-supervised learning</i>). This learning scheme is motivated by the fact that the process of acquiring the labeling information of data could be quite costly and sometimes prone to mislabeling. The general spectrum of learning from data is envisioned in Figure 1. As shown, in many situations, the data is neither perfectly nor completely labeled.<div><br></div><div>LPS aims at using available labeled samples in order to guide the process of building classification and clustering machineries and help boost their accuracy. Basically, LPS is a combination of two learning paradigms: supervised and unsupervised where the former deals exclusively with labeled data and the latter is concerned with unlabeled data. Hence, the following questions:</div><div><br></div><div><ul><li>Can we improve supervised learning with unlabeled data? <br></li><li>Can we guide unsupervised learning by incorporating few labeled samples?<br></li></ul></div><div><br></div><div>Typical LPS applications are medical diagnosis (Bouchachia & Pedrycz, 2006a), facial expression recognition (Cohen et al., 2004), text classification (Nigam et al., 2000), protein classification (Weston et al., 2003), and several natural language processing applications such as word sense disambiguation (Niu et al., 2005), and text chunking (Ando & Zhangz, 2005).</div><div><br></div><div>Because LPS is still a young but active research field, it lacks a survey outlining the existing approaches and research trends. In this chapter, we will take a step towards an overview. We will discuss (i) the background of LPS, (iii) the main focus of our LPS research and explain the underlying assumptions behind LPS, and (iv) future directions and challenges of LPS research. </div>

Download Full-text

Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques

10.1007/978-3-030-86890-1_23 ◽

2021 ◽

pp. 402-425

Author(s):

Wenzhuo Yang ◽

Kwok-Yan Lam

Keyword(s):

Anomaly Detection ◽

Supervised Learning ◽

Unlabeled Data ◽

Weakly Supervised Learning ◽

Detection Model ◽

Learning Techniques ◽

Model Training ◽

Weakly Supervised

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch192 ◽

2011 ◽

pp. 1022-1027

Author(s):

Tobias Scheffer

Keyword(s):

Supervised Learning ◽

Supervised Classification ◽

Unlabeled Data ◽

Training Data ◽

Classification Algorithms ◽

Classification Problems

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data.

Download Full-text

Clustering-Based Transductive Semi-Supervised Learning for Learning-to-Rank

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419510078 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1951007 ◽

Cited By ~ 1

Author(s):

Ashwini Rahangdale ◽

Shital Raut

Keyword(s):

Supervised Learning ◽

Learning To Rank ◽

Cost Effective ◽

Unlabeled Data ◽

Training Data ◽

Density Region ◽

The Arts ◽

Proposed Model ◽

Supervised Learning Algorithms ◽

Multiple Loss

Learning-to-rank (LTR) is a very hot topic of research for information retrieval (IR). LTR framework usually learns the ranking function using available training data that are very cost-effective, time-consuming and biased. When sufficient amount of training data is not available, semi-supervised learning is one of the machine learning paradigms that can be applied to get pseudo label from unlabeled data. Cluster and label is a basic approach for semi-supervised learning to identify the high-density region in data space which is mainly used to support the supervised learning. However, clustering with conventional method may lead to prediction performance which is worse than supervised learning algorithms for application of LTR. Thus, we propose rank preserving clustering (RPC) with PLocalSearch and get pseudo label for unlabeled data. We present semi-supervised learning that adopts clustering-based transductive method and combine it with nonmeasure specific listwise approach to learn the LTR model. Moreover, each cluster follows the multi-task learning to avoid optimization of multiple loss functions. It reduces the training complexity of adopted listwise approach from an exponential order to a polynomial order. Empirical analysis on the standard datasets (LETOR) shows that the proposed model gives better results as compared to other state-of-the-arts.

Download Full-text

Comparison of Adjusted Methods for Selecting Useful Unlabeled Data for Semi-Supervised Learning Algorithms

Current Approaches in Applied Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-319-19066-2_51 ◽

2015 ◽

pp. 526-535

Author(s):

Thanh-Binh Le ◽

Sang-Woon Kim

Keyword(s):

Supervised Learning ◽

Learning Algorithms ◽

Unlabeled Data ◽

Supervised Learning Algorithms

Download Full-text

Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

Entropy ◽

10.3390/e21100988 ◽

2019 ◽

Vol 21 (10) ◽

pp. 988 ◽

Cited By ~ 4

Author(s):

Fazakis ◽

Kanas ◽

Aridas ◽

Karlos ◽

Kotsiantis

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Unlabeled Data ◽

Classification Algorithms ◽

Training Phase ◽

Learning Methods ◽

Training Scheme ◽

Wide Range ◽

Benchmark Datasets ◽

Scientific Fields

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Download Full-text

Exploiting Unlabeled Data in CNNs by Self-Supervised Learning to Rank

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2019.2899857 ◽

2019 ◽

Vol 41 (8) ◽

pp. 1862-1878 ◽

Cited By ~ 23

Author(s):

Xialei Liu ◽

Joost van de Weijer ◽

Andrew D. Bagdanov

Keyword(s):

Supervised Learning ◽

Learning To Rank ◽

Unlabeled Data

Download Full-text

Time-Series Laplacian Semi-Supervised Learning for Indoor Localization †

Sensors ◽

10.3390/s19183867 ◽

2019 ◽

Vol 19 (18) ◽

pp. 3867 ◽

Cited By ~ 2

Author(s):

Jaehyun Yoo

Keyword(s):

Time Series ◽

Supervised Learning ◽

Indoor Localization ◽

Learning Algorithm ◽

Unlabeled Data ◽

Training Data ◽

Practical Implementation ◽

Additional Information ◽

Localization Scheme ◽

The Impact

Machine learning-based indoor localization used to suffer from the collection, construction, and maintenance of labeled training databases for practical implementation. Semi-supervised learning methods have been developed as efficient indoor localization methods to reduce use of labeled training data. To boost the efficiency and the accuracy of indoor localization, this paper proposes a new time-series semi-supervised learning algorithm. The key aspect of the developed method, which distinguishes it from conventional semi-supervised algorithms, is the use of unlabeled data. The learning algorithm finds spatio-temporal relationships in the unlabeled data, and pseudolabels are generated to compensate for the lack of labeled training data. In the next step, another balancing-optimization learning algorithm learns a positioning model. The proposed method is evaluated for estimating the location of a smartphone user by using a Wi-Fi received signal strength indicator (RSSI) measurement. The experimental results show that the developed learning algorithm outperforms some existing semi-supervised algorithms according to the variation of the number of training data and access points. Also, the proposed method is discussed in terms of why it gives better performance, by the analysis of the impact of the learning parameters. Moreover, the extended localization scheme in conjunction with a particle filter is executed to include additional information, such as a floor plan.

Download Full-text