Improving the Classification of Rare Chords With Unlabeled Data

The lack of labeled data is one of the main prohibiting issues on the development of deep learning models, as they rely on large labeled datasets in order to achieve high accuracy in complex tasks. Our objective is to evaluate the performance gain of having additional unlabeled data in the training of a deep learning model when working with medical imaging data. We present a semi-supervised learning algorithm that utilizes a teacher-student paradigm in order to leverage unlabeled data in the classification of chest X-ray images. Using our algorithm on the ChestX-ray14 dataset, we manage to achieve a substantial increase in performance when using small labeled datasets. With our method, a model achieves an AUROC of 0.822 with only 2% labeled data and 0.865 with 5% labeled data, while a fully supervised method achieves an AUROC of 0.807 with 5% labeled data and only 0.845 with 10%.

Download Full-text

Clustering unlabeled data with SOMs improves classification of labeled real-world data

Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290) ◽

10.1109/ijcnn.2002.1007489 ◽

2003 ◽

Cited By ~ 30

Author(s):

R. Dara ◽

S.C. Kremer ◽

D.A. Stacey

Keyword(s):

Real World ◽

Unlabeled Data ◽

Real World Data ◽

World Data

Download Full-text

Classification of Diabetic Retinopathy Using Unlabeled Data and Knowledge Distillation

Artificial Intelligence in Medicine ◽

10.1016/j.artmed.2021.102176 ◽

2021 ◽

pp. 102176

Author(s):

Sajjad Abbasi ◽

Mohsen Hajabdollahi ◽

Pejman Khadivi ◽

Nader Karimi ◽

Roshanak Roshandel ◽

...

Keyword(s):

Diabetic Retinopathy ◽

Unlabeled Data ◽

Knowledge Distillation

Download Full-text

Semi-Supervised Classification with Co-Training for Deep Web

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.439-440.183 ◽

2010 ◽

Vol 439-440 ◽

pp. 183-188 ◽

Cited By ~ 1

Author(s):

Wei Fang ◽

Zhi Ming Cui

Keyword(s):

Supervised Classification ◽

Deep Web ◽

Unlabeled Data ◽

Supervised Machine Learning ◽

Web Pages ◽

The Novel ◽

Training Process ◽

Novel Approach ◽

The Cost

The main problems in Web Pages classification are lack of labeled data, as well as the cost of labeling the unlabeled data. In this paper we discuss the application of semi-supervised machine learning method co-training on classification of Deep Web query interfaces to boost the performance of a classifier. Then, Bayes and Maxim Entropy algorithm are co-operated to incorporate labeled data with unlabeled data in training process incrementally. Our experiment results show the novel approach has a promising performance.

Download Full-text

Co-Training Semi-Supervised Deep Learning for Sentiment Classification of MOOC Forum Posts

Symmetry ◽

10.3390/sym12010008 ◽

2019 ◽

Vol 12 (1) ◽

pp. 8

Author(s):

Jing Chen ◽

Jun Feng ◽

Xia Sun ◽

Yang Liu

Keyword(s):

Deep Learning ◽

Large Scale ◽

Sample Selection ◽

Classification Performance ◽

Sentiment Classification ◽

Unlabeled Data ◽

Learning Performance ◽

Average Accuracy ◽

Complex Features

Sentiment classification of forum posts of massive open online courses is essential for educators to make interventions and for instructors to improve learning performance. Lacking monitoring on learners’ sentiments may lead to high dropout rates of courses. Recently, deep learning has emerged as an outstanding machine learning technique for sentiment classification, which extracts complex features automatically with rich representation capabilities. However, deep neural networks always rely on a large amount of labeled data for supervised training. Constructing large-scale labeled training datasets for sentiment classification is very laborious and time consuming. To address this problem, this paper proposes a co-training, semi-supervised deep learning model for sentiment classification, leveraging limited labeled data and massive unlabeled data simultaneously to achieve performance comparable to those methods trained on massive labeled data. To satisfy the condition of two views of co-training, we encoded texts into vectors from views of word embedding and character-based embedding independently, considering words’ external and internal information. To promote the classification performance with limited data, we propose a double-check strategy sample selection method to select samples with high confidence to augment the training set iteratively. In addition, we propose a mixed loss function both considering the labeled data with asymmetric and unlabeled data. Our proposed method achieved a 89.73% average accuracy and an 93.55% average F1-score, about 2.77% and 3.2% higher than baseline methods. Experimental results demonstrate the effectiveness of the proposed model trained on limited labeled data, which performs much better than those trained on massive labeled data.

Download Full-text

Classification of Few Labeled Images Based on Integrated GMM Clustering

Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University ◽

10.1051/jnwpu/20193730465 ◽

2019 ◽

Vol 37 (3) ◽

pp. 465-470

Author(s):

Pengfei Zhang ◽

Minzhou Dong ◽

Junhong Duan

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Classification Accuracy ◽

Unlabeled Data ◽

Data Sets ◽

Neural Network Training ◽

Model Classification ◽

Network Training ◽

Present Algorithm

In order to improve the classifier classification accuracy of by using convolutional neural network training, a large amount of labeled data is often required, but sometimes labeled data is not easily obtained.This paper proposes a solution based on the idea of integrated GMM clustering and label delivery for classifying images with few labeled samples, assigning tags to unlabeled data through certain rules, and converting unlabeled data into labeled data for training of the model.In this paper, experiments are performed on hand-written digital recognition data sets. The results show that the present algorithm has a great improvement in the accuracy of model classification comparing with the method of using only labeled samples in the case of few labeled samples. The effectiveness of the present algorithm is validated.

Download Full-text

Using unlabeled data to improve classification of emotional states in human computer interaction

Journal on Multimodal User Interfaces ◽

10.1007/s12193-013-0133-0 ◽

2013 ◽

Vol 8 (1) ◽

pp. 5-16 ◽

Cited By ~ 21

Author(s):

Martin Schels ◽

Markus Kächele ◽

Michael Glodek ◽

David Hrabal ◽

Steffen Walter ◽

...

Keyword(s):

Human Computer Interaction ◽

Unlabeled Data ◽

Emotional States ◽

Computer Interaction

Download Full-text

Semi-Supervised Japanese Word Sense Disambiguation Based on Two-Stage Classification of Unlabeled Data and Ensemble Learning

Journal of Natural Language Processing ◽

10.5715/jnlp.18.247 ◽

2011 ◽

Vol 18 (3) ◽

pp. 247-271

Author(s):

Tatsukuni Inoue ◽

Hiroaki Saito

Keyword(s):

Ensemble Learning ◽

Word Sense Disambiguation ◽

Unlabeled Data ◽

Word Sense ◽

Two Stage ◽

Japanese Word ◽

Sense Disambiguation ◽

Stage Classification

Download Full-text

Robust Machine Learning Classification of Unlabeled Biological Data: A case study with herbaria sheets

Biodiversity Information Science and Standards ◽

10.3897/biss.5.73833 ◽

2021 ◽

Vol 5 ◽

Author(s):

Jonathan Koss ◽

Anthony Jiang ◽

Patrick Sweeney ◽

Nelson Rios ◽

Aaron Dollar

Keyword(s):

Biological Data ◽

Unlabeled Data ◽

Flowering Plant ◽

Training Dataset ◽

High Confidence ◽

Deep Convolutional Neural Networks ◽

Machine Learning Classification ◽

Lower Accuracy ◽

Deep Cnn

There is much excitement across a broad range of biological disciplines over the prospect of using deep learning and similar modern statistical methods to label research data. The extensive time, effort, and cost required for humans to label a dataset drastically limits the type and amount of data that can be reasonably utilized, and is currently a major bottleneck to the extensive application of biological datasets such as specimen imagery, video and audio recordings. While a number of researchers have shown how deep convolutional neural networks (CNN) can be trained to classify image data with 80-90% accuracy, that range of accuracy is still too low for most research applications. Furthermore, applying these classifiers to new, unlabeled data from a dataset other than the one used for training the classifier would likely result in even lower accuracy. As a result, these classifiers have still not generally been applied to unlabeled data—which is where they could be most useful. In this talk, we will present a method for determining a confidence metric on predicted classifications (i.e. "labels") from a deep CNN classifier that can inform a user whether to trust a particular automatic label or to discard it, thereby giving a reasonable and straight-forward method to label a previously unlabeled dataset with high confidence. Essentially, it is an approach that allows an imperfect method of classification to be used in a useful way that can save an enormous amount of time and effort and/or greatly increase the amount of data that can be reasonably utilized. In this work, the training dataset consisted of a set of records of flowering plant species that collectively exhibited a range of reproductive morphologies, represented multiple taxonomic groups, and could be easily scored by humans for reproductive condition by examination of specimen images. The records were labeled as reproductive, budding, flowering and/or fruiting. All of the data and images were obtained from the Consortium of Northeastern Herbaria portal (CNH). There were two unscored datasets that were used to evaluate the classifiers. One dataset contained the same taxa that were in the training dataset and the second dataset contained all remaining flowering plant taxa in the CNH portal database that were not included in the other two datasets. Records of families with flowers that are obscure (i.e., they lack petals & sepals or have vestigial structures) were excluded. To label the reproductive state of the plants, we trained one deep CNN classifier using the XCeption architecture for the binary classification of each state (e.g., budding vs. not budding). This method and architecture was chosen because of its success in similar image-classification tasks. Each of these networks takes an image of a herbarium sheet as input, and outputs a value in the interval [0,1]. In these networks, the output is typically thresholded to generate a binary label, but we found it could also be used to approximate a measure of confidence in the network’s classification. By treating this value as a confidence metric, we are able to input a large unlabeled dataset into the classifier and then trust the labels that were assigned a “high confidence” and leave the remainder unlabeled. After training the network, the performance of the four classifiers (reproductive, budding, flowering, fruiting) achieved 85-90% accuracy compared to expert-labeled data. However, as described above, the real value of these approaches comes from their prospects for labelling previously unlabeled data, thus helping to replace expensive and time-consuming human labor. We then applied our confidence-interval-based approach to a collection of 600k images and were able to label 35-70% of the samples with a chosen confidence threshold of 95%. In other words, we could then use the high-confidence labels and simply not automatically label the remaining unclassifiable samples. The data from these samples could then be labeled manually, or, if appropriate, not labeled at all.

Download Full-text