scholarly journals RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

2021 ◽  
Author(s):  
Michał Koziarski ◽  
Colin Bellinger ◽  
Michał Woźniak

AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$ 5 × 2 cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.

2020 ◽  
Vol 34 (04) ◽  
pp. 6680-6687
Author(s):  
Jian Yin ◽  
Chunjing Gan ◽  
Kaiqi Zhao ◽  
Xuan Lin ◽  
Zhe Quan ◽  
...  

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.


2013 ◽  
Vol 427-429 ◽  
pp. 2309-2312
Author(s):  
Hai Bin Mei ◽  
Ming Hua Zhang

Alert classifiers built with the supervised classification technique require large amounts of labeled training alerts. Preparing for such training data is very difficult and expensive. Thus accuracy and feasibility of current classifiers are greatly restricted. This paper employs semi-supervised learning to build alert classification model to reduce the number of needed labeled training alerts. Alert context properties are also introduced to improve the classification performance. Experiments have demonstrated the accuracy and feasibility of our approach.


2020 ◽  
Vol 39 (5) ◽  
pp. 7657-7669
Author(s):  
Linyong Zhou ◽  
Shanping You ◽  
Bimo Ren ◽  
Xuhong Yu ◽  
Xiaoyao Xie

Pulsars are highly magnetized, rotating neutron stars with small volume and high density. The discovery of pulsars is of great significance in the fields of physics and astronomy. With the development of artificial intelligent, image recognition models based on deep learning are increasingly utilized for pulsar candidate identification. However, pulsar candidate datasets are characterized by unbalance and lack of positive samples, which has contributed the traditional methods to fall into poor performance and model bias. To this end, a general image recognition model based on adversarial training is proposed. A generator, a classifier, and two discriminators are included in the model. Theoretical analysis demonstrates that the model has a unique optimal solution, and the classifier happens to be the inference network of the generator. Therefore, the samples produced by the generator significantly augment the diversity of training data. When the model reaches equilibrium, it can not only predict labels for unseen data, but also generate controllable samples. In experiments, we split part of data from MNIST for training. The results reveal that the model not only behaves better classification performance than CNN, but also has better controllability than CGAN and ACGAN. Then, the model is applied to pulsar candidate dataset HTRU and FAST. The results exhibit that, compared with CNN model, the F-score has increased by 1.99% and 3.67%, and the Recall has also increased by 6.28% and 8.59% respectively.


2013 ◽  
Vol 39 (4) ◽  
pp. 847-884 ◽  
Author(s):  
Emili Sapena ◽  
Lluís Padró ◽  
Jordi Turmo

This work is focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that refer to the same entity. The main contributions of this article are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use an entity-mention classification model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classifications without context, and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results at the state-of-the-art level, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second place in CoNLL-2011.


2021 ◽  
Vol 15 ◽  
pp. 174830262110449
Author(s):  
Kai-Jun Hu ◽  
He-Feng Yin ◽  
Jun Sun

During the past decade, representation based classification method has received considerable attention in the community of pattern recognition. The recently proposed non-negative representation based classifier achieved superb recognition results in diverse pattern classification tasks. Unfortunately, discriminative information of training data is not fully exploited in non-negative representation based classifier, which undermines its classification performance in practical applications. To address this problem, we introduce a decorrelation regularizer into the formulation of non-negative representation based classifier and propose a discriminative non-negative representation based classifier for pattern classification. The decorrelation regularizer is able to reduce the correlation of representation results of different classes, thus promoting the competition among them. Experimental results on benchmark datasets validate the efficacy of the proposed discriminative non-negative representation based classifier, and it can outperform some state-of-the-art deep learning based methods. The source code of our proposed discriminative non-negative representation based classifier is accessible at https://github.com/yinhefeng/DNRC .


2017 ◽  
Vol 14 (3) ◽  
pp. 579-595 ◽  
Author(s):  
Lu Cao ◽  
Hong Shen

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.


Data ◽  
2019 ◽  
Vol 4 (3) ◽  
pp. 121
Author(s):  
Matteo Bodini

Interactions between online users are growing more and more in recent years, due to the latest developments of the web. People share online comments, opinions, and reviews about many topics. Aspect extraction is the automatic process of understanding the topic (the aspect) of such comments, which has obtained huge interest from commercial and academic points of view. For instance, reviews available in webshops (like eBay, Amazon, Aliexpress, etc.) can help the customers in purchasing products and automatic analysis of reviews would be useful, as sometimes it is almost impossible to read all the available ones. In recent years, aspect extraction in the Bangla language has been regarded more and more as a task of growing importance. In the previous literature, a few methods have been introduced to classify Bangla texts according to the aspect they were focused on. This kind of research is limited mainly due to the lack of publicly available datasets for aspect extraction in the Bangla language. We take into account the only two publicly available datasets, recently published, collected for the task of aspect extraction in the Bangla language. Then, we introduce several classification methods based on stacked auto-encoders, as far as we know never exploited in the task of aspect extraction in Bangla, and we achieve better aspect classification performance with respect to the state-of-the-art: the experiments show an average improvement of 0.17 , 0.31 and 0.30 (across the two datasets), respectively in precision, recall and F1-score, reported in the state-of-the-art works that tackled the problem.


2014 ◽  
Vol 22 (1) ◽  
pp. 143-154 ◽  
Author(s):  
Sameer Pradhan ◽  
Noémie Elhadad ◽  
Brett R South ◽  
David Martinez ◽  
Lee Christensen ◽  
...  

Abstract Objective The ShARe/CLEF eHealth 2013 Evaluation Lab Task 1 was organized to evaluate the state of the art on the clinical text in (i) disorder mention identification/recognition based on Unified Medical Language System (UMLS) definition (Task 1a) and (ii) disorder mention normalization to an ontology (Task 1b). Such a community evaluation has not been previously executed. Task 1a included a total of 22 system submissions, and Task 1b included 17. Most of the systems employed a combination of rules and machine learners. Materials and methods We used a subset of the Shared Annotated Resources (ShARe) corpus of annotated clinical text—199 clinical notes for training and 99 for testing (roughly 180 K words in total). We provided the community with the annotated gold standard training documents to build systems to identify and normalize disorder mentions. The systems were tested on a held-out gold standard test set to measure their performance. Results For Task 1a, the best-performing system achieved an F1 score of 0.75 (0.80 precision; 0.71 recall). For Task 1b, another system performed best with an accuracy of 0.59. Discussion Most of the participating systems used a hybrid approach by supplementing machine-learning algorithms with features generated by rules and gazetteers created from the training data and from external resources. Conclusions The task of disorder normalization is more challenging than that of identification. The ShARe corpus is available to the community as a reference standard for future studies.


2020 ◽  
Vol 222 (3) ◽  
pp. 1750-1764 ◽  
Author(s):  
Yangkang Chen

SUMMARY Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.


2018 ◽  
Vol 2018 ◽  
pp. 1-9 ◽  
Author(s):  
Mengxi Dai ◽  
Dezhi Zheng ◽  
Shucong Liu ◽  
Pengju Zhang

Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern (CSP) as preprocessing step before classification. The CSP method is a supervised algorithm. Therefore a lot of time-consuming training data is needed to build the model. To address this issue, one promising approach is transfer learning, which generalizes a learning model can extract discriminative information from other subjects for target classification task. To this end, we propose a transfer kernel CSP (TKCSP) approach to learn a domain-invariant kernel by directly matching distributions of source subjects and target subjects. The dataset IVa of BCI Competition III is used to demonstrate the validity by our proposed methods. In the experiment, we compare the classification performance of the TKCSP against CSP, CSP for subject-to-subject transfer (CSP SJ-to-SJ), regularizing CSP (RCSP), stationary subspace CSP (ssCSP), multitask CSP (mtCSP), and the combined mtCSP and ssCSP (ss + mtCSP) method. The results indicate that the superior mean classification performance of TKCSP can achieve 81.14%, especially in case of source subjects with fewer number of training samples. Comprehensive experimental evidence on the dataset verifies the effectiveness and efficiency of the proposed TKCSP approach over several state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document