The Use of Unlabeled Data Versus Labeled Data for Stopping Active Learning for Text Classification

Labeling a text document is usually time consuming because it requires the annotator to read the whole document and check its relevance with each possible class label. It thus becomes rather expensive to train an effective model for text classification when it involves a large dataset of long documents. In this paper, we propose an active learning approach for text classification with lower annotation cost. Instead of scanning all the examples in the unlabeled data pool to select the best one for query, the proposed method automatically generates the most informative examples based on the classification model, and thus can be applied to tasks with large scale or even infinite unlabeled data. Furthermore, we propose to approximate the generated example with a few summary words by sparse reconstruction, which allows the annotators to easily assign the class label by reading a few words rather than the long document. Experiments on different datasets demonstrate that the proposed approach can effectively improve the classification performance while significantly reduce the annotation cost.

Download Full-text

Rethinking deep active learning: Using unlabeled data at model training

2020 25th International Conference on Pattern Recognition (ICPR) ◽

10.1109/icpr48806.2021.9412716 ◽

2021 ◽

Author(s):

Oriane Simeoni ◽

Mateusz Budnik ◽

Yannis Avrithis ◽

Guillaume Gravier

Keyword(s):

Active Learning ◽

Unlabeled Data ◽

Model Training

Download Full-text

Active Learning for Arabic Text Classification

2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE) ◽

10.1109/iccike51210.2021.9410758 ◽

2021 ◽

Author(s):

Abdel-Karim Al-Tamimi ◽

Esraa Bani-Isaa ◽

Ahmed Al-Alami

Keyword(s):

Active Learning ◽

Text Classification ◽

Arabic Text ◽

Arabic Text Classification

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

Deep Active Learning for Text Classification

Proceedings of the 2nd International Conference on Vision, Image and Signal Processing - ICVISP 2018 ◽

10.1145/3271553.3271578 ◽

2018 ◽

Cited By ~ 2

Author(s):

Bang An ◽

Wenjun Wu ◽

Huimin Han

Keyword(s):

Active Learning ◽

Text Classification

Download Full-text

Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

Entropy ◽

10.3390/e21100988 ◽

2019 ◽

Vol 21 (10) ◽

pp. 988 ◽

Cited By ~ 4

Author(s):

Fazakis ◽

Kanas ◽

Aridas ◽

Karlos ◽

Kotsiantis

Keyword(s):

Active Learning ◽

Supervised Learning ◽

Unlabeled Data ◽

Classification Algorithms ◽

Training Phase ◽

Learning Methods ◽

Training Scheme ◽

Wide Range ◽

Benchmark Datasets ◽

Scientific Fields

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Download Full-text