scholarly journals Active Learning Based on Crowdsourced Data

2022 ◽  
Vol 12 (1) ◽  
pp. 409
Author(s):  
Tomasz Maria Boiński ◽  
Julian Szymański ◽  
Agata Krauzewicz

The paper proposes a crowdsourcing-based approach for annotated data acquisition and means to support Active Learning training approach. In the proposed solution, aimed at data engineers, the knowledge of the crowd serves as an oracle that is able to judge whether the given sample is informative or not. The proposed solution reduces the amount of work needed to annotate large sets of data. Furthermore, it allows a perpetual increase in the trained network quality by the inclusion of new samples, gathered after network deployment. The paper also discusses means of limiting network training times, especially in the post-deployment stage, where the size of the training set can increase dramatically. This is done by the introduction of the fourth set composed of samples gather during network actual usage.

Author(s):  
Castaño, Mary Caroline N.

ABSTRACT The entry of smartphones into our lives is due to two primary reasons – the rapid advancement in technology and R & D, making present technology redundant within weeks and the drastic drop in prices of smartphones which occur weekly or monthly. The objectives of this paper are: (1) To provide a more holistic view of smartphone users' preference (2) To have depth analysis on how consumers put a premium on various smartphone features application and tools (3) To understand how prospective customers appreciate the good features of the product. Three statistical tools were used: Frequency Distribution to get the profile of the respondent's actual usage of smartphones and attitudes of consumers, Pearson Correlation, and Conjoint analysis, which was used to analyze the preference of the respondents on smartphone attributes. This study showed a moderately fit conjoint model, Pearson R =.742, p<.05, Kendall's Tau was .333, p<.05 and .333, p< .05 for the holdouts. From the given set of attributes, price (47.11%) is the most important, followed by the SIM card slot (19.05%), and the phone plan (9.14%). This paper is the first study done in the Philippines about the usage, attitudes of consumers towards smartphones using conjoint analysis. The analysis would help companies to understand what aspects of their products are essential and irrelevant. Companies will act upon a certain aspect to ensure higher profitability. Type of Paper: Empirical Keywords: local government hospitals; Philippines; policy direction; quality patient care


2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


2021 ◽  
Vol 13 (11) ◽  
pp. 2234
Author(s):  
Xin Luo ◽  
Huaqiang Du ◽  
Guomo Zhou ◽  
Xuejian Li ◽  
Fangjie Mao ◽  
...  

An informative training set is necessary for ensuring the robust performance of the classification of very-high-resolution remote sensing (VHRRS) images, but labeling work is often difficult, expensive, and time-consuming. This makes active learning (AL) an important part of an image analysis framework. AL aims to efficiently build a representative and efficient library of training samples that are most informative for the underlying classification task, thereby minimizing the cost of obtaining labeled data. Based on ranked batch-mode active learning (RBMAL), this paper proposes a novel combined query strategy of spectral information divergence lowest confidence uncertainty sampling (SIDLC), called RBSIDLC. The base classifier of random forest (RF) is initialized by using a small initial training set, and each unlabeled sample is analyzed to obtain the classification uncertainty score. A spectral information divergence (SID) function is then used to calculate the similarity score, and according to the final score, the unlabeled samples are ranked in descending lists. The most “valuable” samples are selected according to ranked lists and then labeled by the analyst/expert (also called the oracle). Finally, these samples are added to the training set, and the RF is retrained for the next iteration. The whole procedure is iteratively implemented until a stopping criterion is met. The results indicate that RBSIDLC achieves high-precision extraction of urban land use information based on VHRRS; the accuracy of extraction for each land-use type is greater than 90%, and the overall accuracy (OA) is greater than 96%. After the SID replaces the Euclidean distance in the RBMAL algorithm, the RBSIDLC method greatly reduces the misclassification rate among different land types. Therefore, the similarity function based on SID performs better than that based on the Euclidean distance. In addition, the OA of RF classification is greater than 90%, suggesting that it is feasible to use RF to estimate the uncertainty score. Compared with the three single query strategies of other AL methods, sample labeling with the SIDLC combined query strategy yields a lower cost and higher quality, thus effectively reducing the misclassification rate of different land use types. For example, compared with the Batch_Based_Entropy (BBE) algorithm, RBSIDLC improves the precision of barren land extraction by 37% and that of vegetation by 14%. The 25 characteristics of different land use types screened by RF cross-validation (RFCV) combined with the permutation method exhibit an excellent separation degree, and the results provide the basis for VHRRS information extraction in urban land use settings based on RBSIDLC.


Author(s):  
Hengyi Cai ◽  
Hongshen Chen ◽  
Yonghao Song ◽  
Xiaofang Zhao ◽  
Dawei Yin

Humans benefit from previous experiences when taking actions. Similarly, related examples from the training data also provide exemplary information for neural dialogue models when responding to a given input message. However, effectively fusing such exemplary information into dialogue generation is non-trivial: useful exemplars are required to be not only literally-similar, but also topic-related with the given context. Noisy exemplars impair the neural dialogue models understanding the conversation topics and even corrupt the response generation. To address the issues, we propose an exemplar guided neural dialogue generation model where exemplar responses are retrieved in terms of both the text similarity and the topic proximity through a two-stage exemplar retrieval model. In the first stage, a small subset of conversations is retrieved from a training set given a dialogue context. These candidate exemplars are then finely ranked regarding the topical proximity to choose the best-matched exemplar response. To further induce the neural dialogue generation model consulting the exemplar response and the conversation topics more faithfully, we introduce a multi-source sampling mechanism to provide the dialogue model with both local exemplary semantics and global topical guidance during decoding. Empirical evaluations on a large-scale conversation dataset show that the proposed approach significantly outperforms the state-of-the-art in terms of both the quantitative metrics and human evaluations.


2020 ◽  
Author(s):  
Marcelo Inuzuka ◽  
Hugo Do Nascimento ◽  
Fernando Almeida ◽  
Bruno Barros ◽  
Walid Jradi

This article introduces Doclass, a free and open-source software for the Web that aims to assist in labeling and classifying large sets of documents. The research involved a design science research methodology, guided by the real demands of a legal text processing company. The architecture, several design decisions and the current development stage of the software are presented. Preliminary user experiments for evaluating interactive document labeling are described. As a result, the first version of a system with an architecture composed of a mobile frontend that communicates with a backend through a REST API was published, with satisfactory performance evaluation by the applicant. Other results involve the use of active learning techniques to reduce human effort when performing the classification of documents, as well as the Uncertainty strategy to choose the document to be labeled. The effectiveness of the stop criterion for the active learning technique based on confidence level was tested and proved unsatisfactory, remaining as a future work.


Author(s):  
Changdong Xu ◽  
Xin Geng

Hierarchical classification is a challenging problem where the class labels are organized in a predefined hierarchy. One primary challenge in hierarchical classification is the small training set issue of the local module. The local classifiers in the previous hierarchical classification approaches are prone to over-fitting, which becomes a major bottleneck of hierarchical classification. Fortunately, the labels in the local module are correlated, and the siblings of the true label can provide additional supervision information for the instance. This paper proposes a novel method to deal with the small training set issue. The key idea of the method is to represent the correlation among the labels by the label distribution. It generates a label distribution that contains the supervision information of each label for the given instance, and then learns a mapping from the instance to the label distribution. Experimental results on several hierarchical classification datasets show that our method significantly outperforms other state-of-theart hierarchical classification approaches.


1994 ◽  
Vol 05 (01) ◽  
pp. 67-75 ◽  
Author(s):  
BYOUNG-TAK ZHANG

Much previous work on training multilayer neural networks has attempted to speed up the backpropagation algorithm using more sophisticated weight modification rules, whereby all the given training examples are used in a random or predetermined sequence. In this paper we investigate an alternative approach in which the learning proceeds on an increasing number of selected training examples, starting with a small training set. We derive a measure of criticality of examples and present an incremental learning algorithm that uses this measure to select a critical subset of given examples for solving the particular task. Our experimental results suggest that the method can significantly improve training speed and generalization performance in many real applications of neural networks. This method can be used in conjunction with other variations of gradient descent algorithms.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Han Huang ◽  
Hongyu Wang ◽  
Dawei Jin

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following: first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.


2005 ◽  
Vol 293-294 ◽  
pp. 135-142
Author(s):  
Graeme Manson ◽  
Gareth Pierce ◽  
Keith Worden ◽  
Daley Chetwynd

This paper considers the performance of radial basis function neural networks for the purpose of data classification. The methods are illustrated using a simple two class problem. Two techniques for reducing the rate of misclassifications, via the introduction of an “unable to classify” label, are presented. The first of these considers the imposition of a threshold value on the classifier outputs whilst the second considers the replacement of the crisp network weights with interval ranges. Two network training techniques are investigated and it is found that, although thresholding and uncertain weights give similar results, the level of variability of network performance is dependent upon the training approach


Sign in / Sign up

Export Citation Format

Share Document