Combining Clustering and Voting Scheme to Select Initial Training Set for Active Learning

Currently, most researchers select clustering-based algorithms to generate the initial training set for active learning. Considering that for such algorithms, a single clustering is not stable, we propose an initial training set selection algorithm which combines multi-clustering results to select samples. Specifically, after each clustering, it delimits several representative regions. If a sample falls into its corresponding representative region, then the algorithm casts a vote for it to mark that it is a potential representative sample. Finally, after several clustering, the samples with the most votes are selected. Experimental results show that our algorithm can efficiently select the informative samples, and can make the classifier have a more stable performance.

Download Full-text

Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24775-3_46 ◽

2004 ◽

pp. 384-388 ◽

Cited By ~ 14

Author(s):

Jaeho Kang ◽

Kwang Ryel Ryu ◽

Hyuk-Chul Kwon

Keyword(s):

Active Learning ◽

Text Classification ◽

Initial Training ◽

Training Set

Download Full-text

Use of Molecular Similarity Indices for QSAR Training Set Selection

SAR and QSAR in Environmental Research ◽

10.1080/10629369508050154 ◽

1995 ◽

Vol 3 (4) ◽

pp. 279-292 ◽

Cited By ~ 1

Author(s):

I. T. Cousins ◽

M. T. D. Cronin ◽

J. C. Dearden ◽

C. D. Watts

Keyword(s):

Molecular Similarity ◽

Training Set ◽

Similarity Indices ◽

Training Set Selection

Download Full-text

A Novel Query Strategy-Based Rank Batch-Mode Active Learning Method for High-Resolution Remote Sensing Image Classification

Remote Sensing ◽

10.3390/rs13112234 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2234

Author(s):

Xin Luo ◽

Huaqiang Du ◽

Guomo Zhou ◽

Xuejian Li ◽

Fangjie Mao ◽

...

Keyword(s):

Land Use ◽

Active Learning ◽

Euclidean Distance ◽

Urban Land Use ◽

Misclassification Rate ◽

Training Set ◽

Batch Mode ◽

Land Use Types ◽

Information Divergence ◽

Uncertainty Score

An informative training set is necessary for ensuring the robust performance of the classification of very-high-resolution remote sensing (VHRRS) images, but labeling work is often difficult, expensive, and time-consuming. This makes active learning (AL) an important part of an image analysis framework. AL aims to efficiently build a representative and efficient library of training samples that are most informative for the underlying classification task, thereby minimizing the cost of obtaining labeled data. Based on ranked batch-mode active learning (RBMAL), this paper proposes a novel combined query strategy of spectral information divergence lowest confidence uncertainty sampling (SIDLC), called RBSIDLC. The base classifier of random forest (RF) is initialized by using a small initial training set, and each unlabeled sample is analyzed to obtain the classification uncertainty score. A spectral information divergence (SID) function is then used to calculate the similarity score, and according to the final score, the unlabeled samples are ranked in descending lists. The most “valuable” samples are selected according to ranked lists and then labeled by the analyst/expert (also called the oracle). Finally, these samples are added to the training set, and the RF is retrained for the next iteration. The whole procedure is iteratively implemented until a stopping criterion is met. The results indicate that RBSIDLC achieves high-precision extraction of urban land use information based on VHRRS; the accuracy of extraction for each land-use type is greater than 90%, and the overall accuracy (OA) is greater than 96%. After the SID replaces the Euclidean distance in the RBMAL algorithm, the RBSIDLC method greatly reduces the misclassification rate among different land types. Therefore, the similarity function based on SID performs better than that based on the Euclidean distance. In addition, the OA of RF classification is greater than 90%, suggesting that it is feasible to use RF to estimate the uncertainty score. Compared with the three single query strategies of other AL methods, sample labeling with the SIDLC combined query strategy yields a lower cost and higher quality, thus effectively reducing the misclassification rate of different land use types. For example, compared with the Batch_Based_Entropy (BBE) algorithm, RBSIDLC improves the precision of barren land extraction by 37% and that of vegetation by 14%. The 25 characteristics of different land use types screened by RF cross-validation (RFCV) combined with the permutation method exhibit an excellent separation degree, and the results provide the basis for VHRRS information extraction in urban land use settings based on RBSIDLC.

Download Full-text

HTS: High-Quality Training Set Selection for Pedestrian Detection

10.1109/icfsp53514.2021.9646422 ◽

2021 ◽

Author(s):

Junjie Li ◽

Kai Shuang ◽

Wentao Zhang

Keyword(s):

Pedestrian Detection ◽

Training Set ◽

High Quality ◽

Selection For ◽

Training Set Selection

Download Full-text

A First Attempt on Monotonic Training Set Selection

Lecture Notes in Computer Science - Hybrid Artificial Intelligent Systems ◽

10.1007/978-3-319-92639-1_23 ◽

2018 ◽

pp. 277-288

Author(s):

J.-R. Cano ◽

S. García

Keyword(s):

Training Set ◽

Training Set Selection

Download Full-text

Efficient Pronunciation Assessment of Taiwanese-Accented English Based on Unsupervised Model Adaptation and Dynamic Sentence Selection

Multidisciplinary Computational Intelligence Techniques ◽

10.4018/978-1-4666-1830-5.ch002 ◽

2012 ◽

pp. 12-30

Author(s):

Chung-Hsien Wu ◽

Hung-Yu Su ◽

Chao-Hong Liu

Keyword(s):

Mutual Information ◽

English Teachers ◽

Experimental Results ◽

Model Adaptation ◽

Selection Algorithm ◽

Accuracy Improvement ◽

Acoustic Models ◽

Efficient Approach ◽

Accented English ◽

Adaptation Method

This chapter presents an efficient approach to personalized pronunciation assessment of Taiwanese-accented English. The main goal of this study is to detect frequently occurring mispronunciation patterns of Taiwanese-accented English instead of scoring English pronunciations directly. The proposed assessment help quickly discover personalized mispronunciations of a student, thus English teachers can spend more time on teaching or rectifying students’ pronunciations. In this approach, an unsupervised model adaptation method is performed on the universal acoustic models to recognize the speech of a specific speaker with mispronunciations and Taiwanese accent. A dynamic sentence selection algorithm, considering the mutual information of the related mispronunciations, is proposed to choose a sentence containing the most undetected mispronunciations in order to quickly extract personalized mispronunciations. The experimental results show that the proposed unsupervised adaptation approach obtains an accuracy improvement of about 2.1% on the recognition of Taiwanese-accented English speech.

Download Full-text

Hierarchical Classification Based on Label Distribution Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015533 ◽

2019 ◽

Vol 33 ◽

pp. 5533-5540 ◽

Cited By ~ 3

Author(s):

Changdong Xu ◽

Xin Geng

Keyword(s):

Hierarchical Classification ◽

Experimental Results ◽

Challenging Problem ◽

Training Set ◽

True Label ◽

Major Bottleneck ◽

Novel Method ◽

Class Labels ◽

Label Distribution ◽

The Given

Hierarchical classification is a challenging problem where the class labels are organized in a predefined hierarchy. One primary challenge in hierarchical classification is the small training set issue of the local module. The local classifiers in the previous hierarchical classification approaches are prone to over-fitting, which becomes a major bottleneck of hierarchical classification. Fortunately, the labels in the local module are correlated, and the siblings of the true label can provide additional supervision information for the instance. This paper proposes a novel method to deal with the small training set issue. The key idea of the method is to represent the correlation among the labels by the label distribution. It generates a label distribution that contains the supervision information of each label for the given instance, and then learns a mapping from the instance to the label distribution. Experimental results on several hierarchical classification datasets show that our method significantly outperforms other state-of-theart hierarchical classification approaches.

Download Full-text

A novel neural approach for unsupervised change detection using SOM clustering for pseudo-training set selection followed by CSOM classifier

2014 IEEE Geoscience and Remote Sensing Symposium ◽

10.1109/igarss.2014.6946706 ◽

2014 ◽

Cited By ~ 2

Author(s):

Victor Neagoe ◽

Alexandru Ciurea ◽

Lorenzo Bruzzone ◽

Francesca Bovolo

Keyword(s):

Change Detection ◽

Training Set ◽

Som Clustering ◽

Training Set Selection

Download Full-text

Abstract-concept learning carryover effects from the initial training set in pigeons (Columba livia).

Journal of Comparative Psychology ◽

10.1037/a0013126 ◽

2009 ◽

Vol 123 (1) ◽

pp. 79-89 ◽

Cited By ~ 11

Author(s):

Tamo Nakamura ◽

Anthony A. Wright ◽

Jeffrey S. Katz ◽

Kent D. Bodily ◽

Bradley R. Sturz

Keyword(s):

Concept Learning ◽

Columba Livia ◽

Abstract Concept ◽

Initial Training ◽

Training Set ◽

Carryover Effects

Download Full-text

Input-Aware Implication Selection Scheme Utilizing ATPG for Efficient Concurrent Error Detection

Electronics ◽

10.3390/electronics7100258 ◽

2018 ◽

Vol 7 (10) ◽

pp. 258 ◽

Cited By ~ 4

Author(s):

Abdus Hassan ◽

Umar Afzaal ◽

Tooba Arifeen ◽

Jeong Lee

Keyword(s):

Error Detection ◽

High Probability ◽

State Of The Art ◽

The State ◽

Experimental Results ◽

Concurrent Error Detection ◽

Selection Algorithm ◽

Probability Of Error ◽

Selection Scheme ◽

Selection Strategies

Recently, concurrent error detection enabled through invariant relationships between different wires in a circuit has been proposed. Because there are many such implications in a circuit, selection strategies have been developed to select the most valuable implications for inclusion in the checker hardware such that a sufficiently high probability of error detection ( P d e t e c t i o n ) is achieved. These algorithms, however, due to their heuristic nature cannot guarantee a lossless P d e t e c t i o n . In this paper, we develop a new input-aware implication selection algorithm with the help of ATPG which minimizes loss on P d e t e c t i o n . In our algorithm, the detectability of errors for each candidate implication is carefully evaluated using error prone vectors. The evaluation results are then utilized to select the most efficient candidates for achieving optimal P d e t e c t i o n . The experimental results on 15 representative combinatorial benchmark circuits from the MCNC benchmarks suite show that the implications selected from our algorithm achieve better P d e t e c t i o n in comparison to the state of the art. The proposed method also offers better performance, up to 41.10%, in terms of the proposed impact-level metric, which is the ratio of achieved P d e t e c t i o n to the implication count.

Download Full-text