A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.

Download Full-text

Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy

Evolutionary Computation ◽

10.1162/evco.2009.17.3.275 ◽

2009 ◽

Vol 17 (3) ◽

pp. 275-306 ◽

Cited By ~ 194

Author(s):

Salvador García ◽

Francisco Herrera

Keyword(s):

Fitness Function ◽

Imbalanced Data ◽

Selection Procedure ◽

Prototype Selection ◽

Imbalanced Datasets ◽

Classification Rate ◽

Minority Class ◽

Good Trade ◽

And Performance ◽

Nonparametric Statistical Procedures

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.

Download Full-text

Imbalanced Data Sets Classification Based on SVM for Sand-Dust Storm Warning

Discrete Dynamics in Nature and Society ◽

10.1155/2015/562724 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8

Author(s):

Yonghua Xie ◽

Yurong Liu ◽

Qingqiu Fu

Keyword(s):

Dust Storm ◽

Adaptive Sampling ◽

Imbalanced Data ◽

Real Data ◽

Classification Performance ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Redundant Data ◽

Sand Dust

In view of the SVM classification for the imbalanced sand-dust storm data sets, this paper proposes a hybrid self-adaptive sampling method named SRU-AIBSMOTE algorithm. This method can adaptively adjust neighboring selection strategy based on the internal distribution of sample sets. It produces virtual minority class instances through randomized interpolation in the spherical space which consists of minority class instances and their neighbors. The random undersampling is also applied to undersample the majority class instances for removal of redundant data in the sample sets. The comparative experimental results on the real data sets from Yanchi and Tongxin districts in Ningxia of China show that the SRU-AIBSMOTE method can obtain better classification performance than some traditional classification methods.

Download Full-text

Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method

Algorithms ◽

10.3390/a14020054 ◽

2021 ◽

Vol 14 (2) ◽

pp. 54

Author(s):

Chen Fu ◽

Jianhua Yang

Keyword(s):

Imbalanced Data ◽

Main Idea ◽

Fuzzy Rule ◽

Classification Performance ◽

Distance Measures ◽

Minkowski Distance ◽

Imbalanced Datasets ◽

Minority Class ◽

Information Granules ◽

Practical Applications

The problem of classification for imbalanced datasets is frequently encountered in practical applications. The data to be classified in this problem are skewed, i.e., the samples of one class (the minority class) are much less than those of other classes (the majority class). When dealing with imbalanced datasets, most classifiers encounter a common limitation, that is, they often obtain better classification performances on the majority classes than those on the minority class. To alleviate the limitation, in this study, a fuzzy rule-based modeling approach using information granules is proposed. Information granules, as some entities derived and abstracted from data, can be used to describe and capture the characteristics (distribution and structure) of data from both majority and minority classes. Since the geometric characteristics of information granules depend on the distance measures used in the granulation process, the main idea of this study is to construct information granules on each class of imbalanced data using Minkowski distance measures and then to establish the classification models by using “If-Then” rules. The experimental results involving synthetic and publicly available datasets reflect that the proposed Minkowski distance-based method can produce information granules with a series of geometric shapes and construct granular models with satisfying classification performance for imbalanced datasets.

Download Full-text

SYNTHETIC OVERSAMPLING OF INSTANCES USING CLUSTERING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500085 ◽

2013 ◽

Vol 22 (02) ◽

pp. 1350008 ◽

Cited By ~ 2

Author(s):

ATLÁNTIDA I. SÁNCHEZ ◽

EDUARDO F. MORALES ◽

JESUS A. GONZALEZ

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Tuning Parameters ◽

New Methods ◽

Real World Applications ◽

Noisy Examples ◽

F Measure ◽

Better Than

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.

Download Full-text

Oversampling for Imbalanced Data via Optimal Transport

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015605 ◽

2019 ◽

Vol 33 ◽

pp. 5605-5612 ◽

Cited By ~ 1

Author(s):

Yuguang Yan ◽

Mingkui Tan ◽

Yanwu Xu ◽

Jiezhang Cao ◽

Michael Ng ◽

...

Keyword(s):

Real World ◽

Optimal Transport ◽

Imbalanced Data ◽

Data Sets ◽

Similar Distribution ◽

Real World Data ◽

Geometric Information ◽

Minority Class ◽

Real World Applications ◽

Multiple Metrics

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Download Full-text

Comprehensive Assessment of Imbalanced Data Classification

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d7349.049420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1426-1431

Keyword(s):

Real World ◽

Imbalanced Data ◽

Predictive Modelling ◽

Classification Problem ◽

Minority Class ◽

Unequal Distribution ◽

Imbalanced Classification ◽

Imbalanced Data Classification ◽

Improved Performance ◽

And Performance

This is an attempt to address the various challenges opportunities and scope for formulating and designing new procedure in imbalanced classification problem which poses a challenge to a predictive modelling as many of AI ML n DL algorithms which are extensively used for classification are always designed from the perspective of with majority of focus on assuming equal number of examples for a class. It leads to poor efficiency and performance especially in minority class. As Minority class is always very crucial and sensitive to classification errors and also its utmost important in imbalanced classification. This chapter discusses addresses and gives novel as well as deep insights with unequal distribution of classes in training datasets. Largely real time and real world classifications are comprising imbalanced distribution so need specialized techniques for more challenging and sophisticated models with minimal errors and improved performance.

Download Full-text

Imbalanced data classification based on hybrid resampling and twin support vector machine

Computer Science and Information Systems ◽

10.2298/csis161221017l ◽

2017 ◽

Vol 14 (3) ◽

pp. 579-595 ◽

Cited By ~ 2

Author(s):

Lu Cao ◽

Hong Shen

Keyword(s):

Support Vector Machine ◽

Real Life ◽

Imbalanced Data ◽

Data Classification ◽

Training Data ◽

Twin Support Vector Machine ◽

Support Vector ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalanced Data Classification

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.

Download Full-text

A Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets

Abstract and Applied Analysis ◽

10.1155/2013/196256 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 9

Author(s):

Yong Zhang ◽

Dapeng Wang

Keyword(s):

Imbalanced Data ◽

Ensemble Method ◽

Support Vector ◽

Data Sets ◽

Imbalanced Learning ◽

Imbalanced Datasets ◽

Learning Methods ◽

Training Samples ◽

Imbalanced Data Classification ◽

Area Under Roc Curve

In imbalanced learning methods, resampling methods modify an imbalanced dataset to form a balanced dataset. Balanced data sets perform better than imbalanced datasets for many base classifiers. This paper proposes a cost-sensitive ensemble method based on cost-sensitive support vector machine (SVM), and query-by-committee (QBC) to solve imbalanced data classification. The proposed method first divides the majority-class dataset into several subdatasets according to the proportion of imbalanced samples and trains subclassifiers using AdaBoost method. Then, the proposed method generates candidate training samples by QBC active learning method and uses cost-sensitive SVM to learn the training samples. By using 5 class-imbalanced datasets, experimental results show that the proposed method has higher area under ROC curve (AUC), F-measure, and G-mean than many existing class-imbalanced learning methods.

Download Full-text

A Systematic Methodology on Class Imbalanced Problems involved in the Classification of Real-World Datasets

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5756.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 7071-7081

Keyword(s):

Machine Learning ◽

Real World ◽

Imbalanced Data ◽

Past Research ◽

Future Research ◽

Data Sets ◽

Time Data ◽

Real World Data ◽

The Past ◽

Research Studies

Current generation real-world data sets processed through machine learning are imbalanced by nature. This imbalanced data enables the researchers with a challenging scenario in the context of perdition for both the machine learning and data mining algorithms. It is observed from the past research studies most of the imbalanced data sets consists of the major classes and minor classes and the major class leads the minor class. Several standards and hybrid prediction algorithms are proposed in various application domains but in most of the real-time data sets analyzed in the studies are imbalanced by nature thereby affecting the accuracy of the prediction. This paper presents a systematic survey of the past research studies to analyze intrinsic data characteristics and techniques utilized for handling class-imbalanced data. In addition, this study reveals the research gaps, trends and patterns in existing studies and discusses briefly on future research directions

Download Full-text

A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1291 ◽

2011 ◽

Vol 271-273 ◽

pp. 1291-1296

Author(s):

Jin Wei Zhang ◽

Hui Juan Lu ◽

Wu Tao Chen ◽

Yi Lu

Keyword(s):

Sampling Method ◽

Sampling Methods ◽

Imbalanced Data ◽

Distribution Data ◽

Data Sets ◽

Data Set ◽

Minority Class ◽

Misclassification Cost ◽

Cost Sensitive Learning ◽

Class Distribution

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

Download Full-text