Leveraging Unlabeled Data for Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch181 ◽

2011 ◽

pp. 1164-1169

Author(s):

Yinghui Yang ◽

Balaji Padmanabhan

Keyword(s):

Research Question ◽

Unlabeled Data ◽

Training Data ◽

Bank Loan ◽

Classification Model ◽

Classification Models ◽

Class Label ◽

Data Record ◽

Model Training ◽

Class Labels

Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text

Fully automated contrast and non-contrast cardiac view detection in echocardiography a multi-centre, multi-vendor study

European Heart Journal ◽

10.1093/ehjci/ehaa946.0078 ◽

2020 ◽

Vol 41 (Supplement_2) ◽

Author(s):

S Gao ◽

D Stojanovski ◽

A Parker ◽

P Marques ◽

S Heitner ◽

...

Keyword(s):

Neural Network ◽

Training Data ◽

Classification Model ◽

Validation Dataset ◽

Funding Source ◽

Private Company ◽

Validation Data ◽

Independent Test ◽

Model Training ◽

Confusion Matrices

Abstract Background Correctly identifying views acquired in a 2D echocardiographic examination is paramount to post-processing and quantification steps often performed as part of most clinical workflows. In many exams, particularly in stress echocardiography, microbubble contrast is used which greatly affects the appearance of the cardiac views. Here we present a bespoke, fully automated convolutional neural network (CNN) which identifies apical 2, 3, and 4 chamber, and short axis (SAX) views acquired with and without contrast. The CNN was tested in a completely independent, external dataset with the data acquired in a different country than that used to train the neural network. Methods Training data comprised of 2D echocardiograms was taken from 1014 subjects from a prospective multisite, multi-vendor, UK trial with the number of frames in each view greater than 17,500. Prior to view classification model training, images were processed using standard techniques to ensure homogenous and normalised image inputs to the training pipeline. A bespoke CNN was built using the minimum number of convolutional layers required with batch normalisation, and including dropout for reducing overfitting. Before processing, the data was split into 90% for model training (211,958 frames), and 10% used as a validation dataset (23,946 frames). Image frames from different subjects were separated out entirely amongst the training and validation datasets. Further, a separate trial dataset of 240 studies acquired in the USA was used as an independent test dataset (39,401 frames). Results Figure 1 shows the confusion matrices for both validation data (left) and independent test data (right), with an overall accuracy of 96% and 95% for the validation and test datasets respectively. The accuracy for the non-contrast cardiac views of >99% exceeds that seen in other works. The combined datasets included images acquired across ultrasound manufacturers and models from 12 clinical sites. Conclusion We have developed a CNN capable of automatically accurately identifying all relevant cardiac views used in “real world” echo exams, including views acquired with contrast. Use of the CNN in a routine clinical workflow could improve efficiency of quantification steps performed after image acquisition. This was tested on an independent dataset acquired in a different country to that used to train the model and was found to perform similarly thus indicating the generalisability of the model. Figure 1. Confusion matrices Funding Acknowledgement Type of funding source: Private company. Main funding source(s): Ultromics Ltd.

Download Full-text

SEMI-SUPERVISED SEQUENCE CLASSIFICATION WITH HMMs

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001405004034 ◽

2005 ◽

Vol 19 (02) ◽

pp. 165-182 ◽

Cited By ~ 7

Author(s):

SHI ZHONG

Keyword(s):

Supervised Learning ◽

Learning Strategies ◽

Test Data ◽

Unlabeled Data ◽

Training Data ◽

Model Complexity ◽

Model Parameters ◽

Training Process ◽

Transductive Learning ◽

Model Training

Using unlabeled data to help supervised learning has become an increasingly attractive methodology and proven to be effective in many applications. This paper applies semi-supervised classification algorithms, based on hidden Markov models, to classify sequences. For model-based classification, semi-supervised learning amounts to using both labeled and unlabeled data to train model parameters. We examine three different strategies of using labeled and unlabeled data in the model training process. These strategies differ in how and when labeled and unlabeled data contribute to the model training process. We also compare regular semi-supervised learning, where there are separate unlabeled training data and unlabeled test data, with transductive learning where we do not differentiate between unlabeled training data and unlabeled test data. Our experimental results on synthetic and real EEG time-series show that substantially improved classification accuracy can be achieved by these semi-supervised learning strategies. The effect of model complexity on semi-supervised learning is also studied in our experiments.

Download Full-text

Comparison of Regression and Classification Models for User-Independent and Personal Stress Detection

Sensors ◽

10.3390/s20164402 ◽

2020 ◽

Vol 20 (16) ◽

pp. 4402

Author(s):

Pekka Siirtola ◽

Juha Röning

Keyword(s):

Feature Selection ◽

Regression Model ◽

Regression Models ◽

Personal Data ◽

The Other ◽

Training Data ◽

Classification Model ◽

Classification Models ◽

Stress Detection ◽

Personal Training

In this article, regression and classification models are compared for stress detection. Both personal and user-independent models are experimented. The article is based on publicly open dataset called AffectiveROAD, which contains data gathered using Empatica E4 sensor and unlike most of the other stress detection datasets, it contains continuous target variables. The used classification model is Random Forest and the regression model is Bagged tree based ensemble. Based on experiments, regression models outperform classification models, when classifying observations as stressed or not-stressed. The best user-independent results are obtained using a combination of blood volume pulse and skin temperature features, and using these the average balanced accuracy was 74.1% with classification model and 82.3% using regression model. In addition, regression models can be used to estimate the level of the stress. Moreover, the results based on models trained using personal data are not encouraging showing that biosignals have a lot of variation not only between the study subjects but also between the session gathered from the same person. On the other hand, it is shown that with subject-wise feature selection for user-independent model, it is possible to improve recognition models more than by using personal training data to build personal models. In fact, it is shown that with subject-wise feature selection, the average detection rate can be improved as much as 4%-units, and it is especially useful to reduce the variance in the recognition rates between the study subjects.

Download Full-text

Detection and Classification of Defective Hard Candies Based on Image Processing and Convolutional Neural Networks

Electronics ◽

10.3390/electronics10162017 ◽

2021 ◽

Vol 10 (16) ◽

pp. 2017

Author(s):

Jinya Wang ◽

Zhenye Li ◽

Qihang Chen ◽

Kun Ding ◽

Tingting Zhu ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Classification Model ◽

Ellipse Fitting ◽

Classification Models ◽

Inspection Strategy ◽

Model Training ◽

Point Detection ◽

Hard Candy

Defective hard candies are usually produced due to inadequate feeding or insufficient cooling during the candy production process. The human-based inspection strategy needs to be brought up to date with the rapid developments in the confectionery industry. In this paper, a detection and classification method for defective hard candies based on convolutional neural networks (CNNs) is proposed. First, the threshold_li method is used to distinguish between hard candy and background. Second, a segmentation algorithm based on concave point detection and ellipse fitting is used to split the adhesive hard candies. Finally, a classification model based on CNNs is constructed for defective hard candies. According to the types of defective hard candies, 2552 hard candies samples were collected; 70% were used for model training, 15% were used for validation, and 15% were used for testing. Defective hard candy classification models based on CNNs (Alexnet, Googlenet, VGG16, Resnet-18, Resnet34, Resnet50, MobileNetV2, and MnasNet0_5) were constructed and tested. The results show that the classification performances of these deep learning models are similar except MnasNet0_5 with the classification accuracy of 84.28%, and the Resnet50-based classification model is the best (98.71%). This research has certain theoretical reference significance for the intelligent classification of granular products.

Download Full-text

Positive unlabeled learning via wrapper-based adaptive sampling

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/457 ◽

2017 ◽

Cited By ~ 6

Author(s):

Pengyi Yang ◽

Wei Liu ◽

Jean Yang

Keyword(s):

Predictive Model ◽

Adaptive Sampling ◽

Unlabeled Data ◽

Training Data ◽

Classification Model ◽

Negative Data ◽

Prior Probabilities ◽

Discriminant Model

Learning from positive and unlabeled data frequently occurs in applications where only a subset of positive instances is available while the rest of the data are unlabeled. In such scenarios, often the goal is to create a discriminant model that can accurately classify both positive and negative data by modelling from labeled and unlabeled instances. In this study, we propose an adaptive sampling (AdaSampling) approach that utilises prediction probabilities from a model to iteratively update the training data. Starting with equal prior probabilities for all unlabeled data, our method "wraps" around a predictive model to iteratively update these probabilities to distinguish positive and negative instances in unlabeled data. Subsequently, one or more robust negative set(s) can be drawn from unlabeled data, according to the likelihood of each instance being negative, to train a single classification model or ensemble of models.

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch272 ◽

2011 ◽

pp. 1787-1793

Author(s):

Tobias Scheffer

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Error Rate ◽

Learning Algorithms ◽

Gaussian Mixture ◽

Unlabeled Data ◽

Training Data ◽

Support Vector ◽

Classification Problems ◽

Class Labels

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data. Semi-supervised (for an example, see Seeger, 2001) has a long tradition in statistics (Cooper & Freeman, 1970); much early work has focused on Bayesian discrimination of Gaussians. The Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is the most popular method for learning generative models from labeled and unlabeled data. Model-based, generative learning algorithms find model parameters (e.g., the parameters of a Gaussian mixture model) that best explain the available labeled and unlabeled data, and they derive the discriminating classification hypothesis from this model. In discriminative learning, unlabeled data is typically incorporated via the integration of some model assumption into the discriminative framework (Miller & Uyar, 1997; Titterington, Smith, & Makov, 1985). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification results in a bias toward placing the hyperplane in regions of low density p(x). Recently, studies have covered graph-based approaches that rely on the assumption that neighboring instances are more likely to belong to the same class than remote instances (Blum & Chawla, 2001). A distinct approach to utilizing unlabeled data has been proposed by de Sa (1994), Yarowsky (1995) and Blum and Mitchell (1998). When the available attributes can be split into independent and compatible subsets, then multi-view learning algorithms can be employed. Multi-view algorithms, such as co-training (Blum & Mitchell, 1998) and co-EM (Nigam & Ghani, 2000), learn two independent hypotheses, which bootstrap by providing each other with labels for the unlabeled data. An analysis of why training two independent hypotheses that provide each other with conjectured class labels for unlabeled data might be better than EM-like self-training has been provided by Dasgupta, Littman, and McAllester (2001) and has been simplified by Abney (2002). The disagreement rate of two independent hypotheses is an upper bound on the error rate of either hypothesis. Multi-view algorithms minimize the disagreement rate between the peer hypotheses (a situation that is most apparent for the algorithm of Collins & Singer, 1999) and thereby the error rate. Semi-supervised learning is related to active learning. Active learning algorithms are able to actively query the class labels of unlabeled data. By contrast, semi-supervised algorithms are bound to learn from the given data.

Download Full-text

Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula

Applied Sciences ◽

10.3390/app112210536 ◽

2021 ◽

Vol 11 (22) ◽

pp. 10536

Author(s):

Hua Cheng ◽

Renjie Yu ◽

Yixin Tang ◽

Yiquan Fang ◽

Tao Cheng

Keyword(s):

Text Classification ◽

Language Model ◽

Unlabeled Data ◽

Language Models ◽

Classification Model ◽

Specific Domain ◽

Proper Nouns ◽

Teacher Student ◽

Generic Language ◽

Model Training

Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accuracy and robustness of model prediction. However, directly applying a generic language model on a specific domain does not work well. This paper introduces a BERT-based text classification model enhanced by unlabeled data (UL-BERT) in the LaTeX formula domain. A two-stage Pretraining model based on BERT(TP-BERT) is pretrained by unlabeled data in the LaTeX formula domain. A double-prediction pseudo-labeling (DPP) method is introduced to obtain high confidence pseudo-labels for unlabeled data by self-training. Moreover, a multi-rounds teacher–student model training approach is proposed for UL-BERT model training with few labeled data and more unlabeled data with pseudo-labels. Experiments on the classification of the LaTex formula domain show that the classification accuracies have been significantly improved by UL-BERT where the F1 score has been mostly enhanced by 2.76%, and lower resources are needed in model training. It is concluded that our method may be applicable to other specific domains with enormous unlabeled data and limited labelled data.

Download Full-text

Active Learning with Query Generation for Cost-Effective Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6133 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6583-6590

Author(s):

Yi-Fan Yan ◽

Sheng-Jun Huang ◽

Shaoyi Chen ◽

Meng Liao ◽

Jin Xu

Keyword(s):

Active Learning ◽

Text Classification ◽

Large Scale ◽

Cost Effective ◽

Classification Performance ◽

Unlabeled Data ◽

Classification Model ◽

Class Label ◽

Text Document ◽

Query Generation

Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.

Download Full-text

gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles

Applied Sciences ◽

10.3390/app11062511 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2511

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

State Of The Art ◽

Heuristic Method ◽

Good Explanation ◽

Classification Rule ◽

Data Sets ◽

Classification Models ◽

Boundary Values ◽

Class Label ◽

Input Space ◽

Boosted Tree

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.

Download Full-text