Oversampling the minority class in a multi-linear feature space for imbalanced data classification

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.

Download Full-text

Imbalanced data classification based on hybrid resampling and twin support vector machine

Computer Science and Information Systems ◽

10.2298/csis161221017l ◽

2017 ◽

Vol 14 (3) ◽

pp. 579-595 ◽

Cited By ~ 2

Author(s):

Lu Cao ◽

Hong Shen

Keyword(s):

Support Vector Machine ◽

Real Life ◽

Imbalanced Data ◽

Data Classification ◽

Training Data ◽

Twin Support Vector Machine ◽

Support Vector ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalanced Data Classification

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.

Download Full-text

CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification

Neural Computing and Applications ◽

10.1007/s00521-020-05130-z ◽

2020 ◽

Author(s):

Eyad Elyan ◽

Carlos Francisco Moreno-Garcia ◽

Chrisina Jayne

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Minority Class ◽

Imbalanced Data Classification

Download Full-text

Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification

Discrete Dynamics in Nature and Society ◽

10.1155/2021/6647557 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Chunye Wu ◽

Nan Wang ◽

Yu Wang

Keyword(s):

Support Vector Machine ◽

Medical Diagnosis ◽

Imbalanced Data ◽

Data Classification ◽

Recall Rate ◽

Superior Performance ◽

Support Vector ◽

Minority Class ◽

Imbalanced Data Classification ◽

New Strategy

Imbalanced data classification is gaining importance in data mining and machine learning. The minority class recall rate requires special treatment in fields such as medical diagnosis, information security, industry, and computer vision. This paper proposes a new strategy and algorithm based on a cost-sensitive support vector machine to improve the minority class recall rate to 1 because the misclassification of even a few samples can cause serious losses in some physical problems. In the proposed method, the modification employs a margin compensation to make the margin lopsided, enabling decision boundary drift. When the boundary reaches a certain position, the minority class samples will be more generalized to achieve the requirement of a recall rate of 1. In the experiments, the effects of different parameters on the performance of the algorithm were analyzed, and the optimal parameters for a recall rate of 1 were determined. The experimental results reveal that, for the imbalanced data classification problem, the traditional definite cost classification scheme and the models classified using the area under the receiver operating characteristic curve criterion rarely produce results such as a recall rate of 1. The new strategy can yield a minority recall of 1 for imbalanced data as the loss of the majority class is acceptable; moreover, it improves the g -means index. The proposed algorithm provides superior performance in minority recall compared to the conventional methods. The proposed method has important practical significance in credit card fraud, medical diagnosis, and other areas.

Download Full-text

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Applied Sciences ◽

10.3390/app11188546 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8546

Author(s):

Mohamed S. Kraiem ◽

Fernando Sánchez-Hernández ◽

María N. Moreno-García

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

General Nature ◽

Small Range ◽

Minority Class ◽

Unequal Distribution ◽

Imbalanced Data Classification ◽

Wide Range ◽

The Impact ◽

Range Of Values

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.

Download Full-text

A novel imbalanced data classification approach using both under and over sampling

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i5.2785 ◽

2021 ◽

Vol 10 (5) ◽

pp. 2789-2795

Author(s):

Seyyed Mohammad Javadi Moghaddam ◽

Asadollah Noroozi

Keyword(s):

Sampling Methods ◽

Data Distribution ◽

Imbalanced Data ◽

Data Classification ◽

Processing Technique ◽

Imbalanced Dataset ◽

Minority Class ◽

Imbalanced Data Classification ◽

Under Sampling ◽

Sampling Algorithms

The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.

Download Full-text

Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification

Sensors ◽

10.3390/s21196616 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6616

Author(s):

Leehter Yao ◽

Tung-Bin Lin

Keyword(s):

Particle Swarm Optimization ◽

Computer Simulations ◽

Mahalanobis Distance ◽

Imbalanced Data ◽

Data Classification ◽

Minority Class ◽

Swarm Optimization ◽

Sensing Data ◽

Imbalanced Data Classification ◽

Effective Remedy

The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson–Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.

Download Full-text