Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification

Currently, most researchers select clustering-based algorithms to generate the initial training set for active learning. Considering that for such algorithms, a single clustering is not stable, we propose an initial training set selection algorithm which combines multi-clustering results to select samples. Specifically, after each clustering, it delimits several representative regions. If a sample falls into its corresponding representative region, then the algorithm casts a vote for it to mark that it is a potential representative sample. Finally, after several clustering, the samples with the most votes are selected. Experimental results show that our algorithm can efficiently select the informative samples, and can make the classifier have a more stable performance.

Download Full-text

Active Learning for Arabic Text Classification

2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE) ◽

10.1109/iccike51210.2021.9410758 ◽

2021 ◽

Author(s):

Abdel-Karim Al-Tamimi ◽

Esraa Bani-Isaa ◽

Ahmed Al-Alami

Keyword(s):

Active Learning ◽

Text Classification ◽

Arabic Text ◽

Arabic Text Classification

Download Full-text

A Novel Query Strategy-Based Rank Batch-Mode Active Learning Method for High-Resolution Remote Sensing Image Classification

Remote Sensing ◽

10.3390/rs13112234 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2234

Author(s):

Xin Luo ◽

Huaqiang Du ◽

Guomo Zhou ◽

Xuejian Li ◽

Fangjie Mao ◽

...

Keyword(s):

Land Use ◽

Active Learning ◽

Euclidean Distance ◽

Urban Land Use ◽

Misclassification Rate ◽

Training Set ◽

Batch Mode ◽

Land Use Types ◽

Information Divergence ◽

Uncertainty Score

An informative training set is necessary for ensuring the robust performance of the classification of very-high-resolution remote sensing (VHRRS) images, but labeling work is often difficult, expensive, and time-consuming. This makes active learning (AL) an important part of an image analysis framework. AL aims to efficiently build a representative and efficient library of training samples that are most informative for the underlying classification task, thereby minimizing the cost of obtaining labeled data. Based on ranked batch-mode active learning (RBMAL), this paper proposes a novel combined query strategy of spectral information divergence lowest confidence uncertainty sampling (SIDLC), called RBSIDLC. The base classifier of random forest (RF) is initialized by using a small initial training set, and each unlabeled sample is analyzed to obtain the classification uncertainty score. A spectral information divergence (SID) function is then used to calculate the similarity score, and according to the final score, the unlabeled samples are ranked in descending lists. The most “valuable” samples are selected according to ranked lists and then labeled by the analyst/expert (also called the oracle). Finally, these samples are added to the training set, and the RF is retrained for the next iteration. The whole procedure is iteratively implemented until a stopping criterion is met. The results indicate that RBSIDLC achieves high-precision extraction of urban land use information based on VHRRS; the accuracy of extraction for each land-use type is greater than 90%, and the overall accuracy (OA) is greater than 96%. After the SID replaces the Euclidean distance in the RBMAL algorithm, the RBSIDLC method greatly reduces the misclassification rate among different land types. Therefore, the similarity function based on SID performs better than that based on the Euclidean distance. In addition, the OA of RF classification is greater than 90%, suggesting that it is feasible to use RF to estimate the uncertainty score. Compared with the three single query strategies of other AL methods, sample labeling with the SIDLC combined query strategy yields a lower cost and higher quality, thus effectively reducing the misclassification rate of different land use types. For example, compared with the Batch_Based_Entropy (BBE) algorithm, RBSIDLC improves the precision of barren land extraction by 37% and that of vegetation by 14%. The 25 characteristics of different land use types screened by RF cross-validation (RFCV) combined with the permutation method exhibit an excellent separation degree, and the results provide the basis for VHRRS information extraction in urban land use settings based on RBSIDLC.

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

Deep Active Learning for Text Classification

Proceedings of the 2nd International Conference on Vision, Image and Signal Processing - ICVISP 2018 ◽

10.1145/3271553.3271578 ◽

2018 ◽

Cited By ~ 2

Author(s):

Bang An ◽

Wenjun Wu ◽

Huimin Han

Keyword(s):

Active Learning ◽

Text Classification

Download Full-text

Abstract-concept learning carryover effects from the initial training set in pigeons (Columba livia).

Journal of Comparative Psychology ◽

10.1037/a0013126 ◽

2009 ◽

Vol 123 (1) ◽

pp. 79-89 ◽

Cited By ~ 11

Author(s):

Tamo Nakamura ◽

Anthony A. Wright ◽

Jeffrey S. Katz ◽

Kent D. Bodily ◽

Bradley R. Sturz

Keyword(s):

Concept Learning ◽

Columba Livia ◽

Abstract Concept ◽

Initial Training ◽

Training Set ◽

Carryover Effects

Download Full-text

Uncertainty-based active learning with instability estimation for text classification

ACM Transactions on Speech and Language Processing ◽

10.1145/2093153.2093154 ◽

2012 ◽

Vol 8 (4) ◽

pp. 1-21 ◽

Cited By ~ 10

Author(s):

Jingbo Zhu ◽

Matthew Ma

Keyword(s):

Active Learning ◽

Text Classification

Download Full-text

The Use of Unlabeled Data Versus Labeled Data for Stopping Active Learning for Text Classification

2019 IEEE 13th International Conference on Semantic Computing (ICSC) ◽

10.1109/icosc.2019.8665546 ◽

2019 ◽

Cited By ~ 2

Author(s):

Garrett Beatty ◽

Ethan Kochis ◽

Michael Bloodgood

Keyword(s):

Active Learning ◽

Text Classification ◽

Unlabeled Data

Download Full-text

Stopping Active Learning Based on Predicted Change of F Measure for Text Classification

2019 IEEE 13th International Conference on Semantic Computing (ICSC) ◽

10.1109/icosc.2019.8665646 ◽

2019 ◽

Cited By ~ 2

Author(s):

Michael Altschuler ◽

Michael Bloodgood

Keyword(s):

Active Learning ◽

Text Classification ◽

F Measure

Download Full-text

A Low-Cost Named Entity Recognition Research Based on Active Learning

Scientific Programming ◽

10.1155/2018/1890683 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Han Huang ◽

Hongyu Wang ◽

Dawei Jin

Keyword(s):

Active Learning ◽

Language Processing ◽

Selection Process ◽

Conditional Random Field ◽

Low Cost ◽

Named Entity Recognition ◽

Entity Recognition ◽

Training Set ◽

Processing Technologies ◽

Named Entity

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following: first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.

Download Full-text

Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification

Combining Clustering and Voting Scheme to Select Initial Training Set for Active Learning

Active Learning for Arabic Text Classification

A Novel Query Strategy-Based Rank Batch-Mode Active Learning Method for High-Resolution Remote Sensing Image Classification

Headnote Prediction Using Machine Learning

Deep Active Learning for Text Classification

Abstract-concept learning carryover effects from the initial training set in pigeons (Columba livia).

Uncertainty-based active learning with instability estimation for text classification

The Use of Unlabeled Data Versus Labeled Data for Stopping Active Learning for Text Classification

Stopping Active Learning Based on Predicted Change of F Measure for Text Classification

A Low-Cost Named Entity Recognition Research Based on Active Learning

Export Citation Format