PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Author(s):  
Ricky Maulana Fajri ◽  
Samaneh Khoshrou ◽  
Robert Peharz ◽  
Mykola Pechenizkiy
Keyword(s):  
2021 ◽  
Vol 13 (11) ◽  
pp. 2234
Author(s):  
Xin Luo ◽  
Huaqiang Du ◽  
Guomo Zhou ◽  
Xuejian Li ◽  
Fangjie Mao ◽  
...  

An informative training set is necessary for ensuring the robust performance of the classification of very-high-resolution remote sensing (VHRRS) images, but labeling work is often difficult, expensive, and time-consuming. This makes active learning (AL) an important part of an image analysis framework. AL aims to efficiently build a representative and efficient library of training samples that are most informative for the underlying classification task, thereby minimizing the cost of obtaining labeled data. Based on ranked batch-mode active learning (RBMAL), this paper proposes a novel combined query strategy of spectral information divergence lowest confidence uncertainty sampling (SIDLC), called RBSIDLC. The base classifier of random forest (RF) is initialized by using a small initial training set, and each unlabeled sample is analyzed to obtain the classification uncertainty score. A spectral information divergence (SID) function is then used to calculate the similarity score, and according to the final score, the unlabeled samples are ranked in descending lists. The most “valuable” samples are selected according to ranked lists and then labeled by the analyst/expert (also called the oracle). Finally, these samples are added to the training set, and the RF is retrained for the next iteration. The whole procedure is iteratively implemented until a stopping criterion is met. The results indicate that RBSIDLC achieves high-precision extraction of urban land use information based on VHRRS; the accuracy of extraction for each land-use type is greater than 90%, and the overall accuracy (OA) is greater than 96%. After the SID replaces the Euclidean distance in the RBMAL algorithm, the RBSIDLC method greatly reduces the misclassification rate among different land types. Therefore, the similarity function based on SID performs better than that based on the Euclidean distance. In addition, the OA of RF classification is greater than 90%, suggesting that it is feasible to use RF to estimate the uncertainty score. Compared with the three single query strategies of other AL methods, sample labeling with the SIDLC combined query strategy yields a lower cost and higher quality, thus effectively reducing the misclassification rate of different land use types. For example, compared with the Batch_Based_Entropy (BBE) algorithm, RBSIDLC improves the precision of barren land extraction by 37% and that of vegetation by 14%. The 25 characteristics of different land use types screened by RF cross-validation (RFCV) combined with the permutation method exhibit an excellent separation degree, and the results provide the basis for VHRRS information extraction in urban land use settings based on RBSIDLC.


Author(s):  
Jujian Lv ◽  
Huimin Zhao ◽  
Rongjun Chen ◽  
Jin Zhan ◽  
Jianhong Li ◽  
...  

2021 ◽  
Author(s):  
Adrian Ahne ◽  
Guy Fagherazzi ◽  
Xavier Tannier ◽  
Thomas Czernichow ◽  
Francisco Orchard

BACKGROUND The amount of available textual health data such as scientific and biomedical literature is constantly growing and it becomes more and more challenging for health professionals to properly summarise those data and in consequence to practice evidence-based clinical decision making. Moreover, the exploration of large unstructured health text data is very challenging for non experts due to limited time, resources and skills. Current tools to explore text data lack ease of use, need high computation efforts and have difficulties to incorporate domain knowledge and focus on topics of interest. OBJECTIVE We developed a methodology which is able to explore and target topics of interest via an interactive user interface for experts and non-experts. We aim to reach near state of the art performance, while reducing memory consumption, increasing scalability and minimizing user interaction effort to improve the clinical decision making process. The performance is evaluated on diabetes-related abstracts from Pubmed. METHODS The methodology consists of four parts: 1) A novel interpretable hierarchical clustering of documents where each node is defined by headwords (describe documents in this node the most); 2) An efficient classification system to target topics; 3) Minimized users interaction effort through active learning; 4) A visual user interface through which a user interacts. We evaluated our approach on 50,911 diabetes-related abstracts from Pubmed which provide a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against three other strategies: random selection of training instances, uncertainty sampling which chooses instances the model is most uncertain about and an expected gradient length strategy based on convolutional neural networks (CNN). RESULTS For the hierarchical clustering performance, we achieved a F1-Score of 0.73 compared to scikit-learn’s of 0.76. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1-Score over all MeSH codes resulted in satisfying 0.62 F1-Score of our approach, compared to 0.61 of the uncertainty strategy, 0.61 the CNN and 0.45 the random strategy. Moreover, our methodology showed a constant low memory use with increased number of documents but increased execution time. CONCLUSIONS We proposed an easy to use tool for experts and non-experts being able to combine domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore our approach is very memory efficient and highly parallelizable making it interesting for large Big Data sets. This approach can be used by health professionals to rapidly get deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.


Author(s):  
Zengmao Wang ◽  
Bo Du ◽  
Lefei Zhang ◽  
Wenbin Hu ◽  
Dacheng Tao ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document