An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data

In view of the SVM classification for the imbalanced sand-dust storm data sets, this paper proposes a hybrid self-adaptive sampling method named SRU-AIBSMOTE algorithm. This method can adaptively adjust neighboring selection strategy based on the internal distribution of sample sets. It produces virtual minority class instances through randomized interpolation in the spherical space which consists of minority class instances and their neighbors. The random undersampling is also applied to undersample the majority class instances for removal of redundant data in the sample sets. The comparative experimental results on the real data sets from Yanchi and Tongxin districts in Ningxia of China show that the SRU-AIBSMOTE method can obtain better classification performance than some traditional classification methods.

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

Improving the classification performance on imbalanced data sets via new hybrid parameterisation model

Journal of King Saud University - Computer and Information Sciences ◽

10.1016/j.jksuci.2019.04.009 ◽

2019 ◽

Author(s):

Masurah Mohamad ◽

Ali Selamat ◽

Imam Much Subroto ◽

Ondrej Krejcar

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

SYNTHETIC OVERSAMPLING OF INSTANCES USING CLUSTERING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500085 ◽

2013 ◽

Vol 22 (02) ◽

pp. 1350008 ◽

Cited By ~ 2

Author(s):

ATLÁNTIDA I. SÁNCHEZ ◽

EDUARDO F. MORALES ◽

JESUS A. GONZALEZ

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Tuning Parameters ◽

New Methods ◽

Real World Applications ◽

Noisy Examples ◽

F Measure ◽

Better Than

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122.v1 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text

A Novel Selective Ensemble Algorithm for Imbalanced Data Classification Based on Exploratory Undersampling

Mathematical Problems in Engineering ◽

10.1155/2014/358942 ◽

2014 ◽

Vol 2014 ◽

pp. 1-14 ◽

Cited By ~ 5

Author(s):

Qing-Yan Yin ◽

Jiang-She Zhang ◽

Chun-Xia Zhang ◽

Nan-Nan Ji

Keyword(s):

Evaluation Criteria ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Classification ◽

Data Sets ◽

Imbalanced Data Sets ◽

Ensemble Pruning ◽

Imbalanced Data Classification ◽

Selective Ensemble ◽

Nonparametric Statistical

Learning with imbalanced data is one of the emergent challenging tasks in machine learning. Recently, ensemble learning has arisen as an effective solution to class imbalance problems. The combination of bagging and boosting with data preprocessing resampling, namely, the simplest and accurate exploratory undersampling, has become the most popular method for imbalanced data classification. In this paper, we propose a novel selective ensemble construction method based on exploratory undersampling,RotEasy, with the advantage of improving storage requirement and computational efficiency by ensemble pruning technology. Our methodology aims to enhance the diversity between individual classifiers through feature extraction and diversity regularized ensemble pruning. We made a comprehensive comparison between our method and some state-of-the-art imbalanced learning methods. Experimental results on 20 real-world imbalanced data sets show thatRotEasypossesses a significant increase in performance, contrasted by a nonparametric statistical test and various evaluation criteria.

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text

Erratum to: Evaluation Measures of the Classification Performance of Imbalanced Data Sets

Communications in Computer and Information Science - Computational Intelligence and Intelligent Systems ◽

10.1007/978-3-642-04962-0_55 ◽

2009 ◽

pp. E1-E1 ◽

Cited By ~ 1

Author(s):

Qiong Gu ◽

Li Zhu ◽

Zhihua Cai

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Imbalanced Data Sets ◽

Evaluation Measures

Download Full-text

DTO-SMOTE: Delaunay Tessellation Oversampling for Imbalanced Data Sets

Information ◽

10.3390/info11120557 ◽

2020 ◽

Vol 11 (12) ◽

pp. 557

Author(s):

Alexandre M. de Carvalho ◽

Ronaldo C. Prati

Keyword(s):

Machine Learning ◽

Geometric Mean ◽

Imbalanced Data ◽

Sampling Technique ◽

Classification Algorithms ◽

Data Sets ◽

Delaunay Tessellation ◽

Minority Class ◽

Imbalanced Data Sets

One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).

Download Full-text

SMOTE: POTENSI DAN KEKURANGANNYA PADA SURVEI

E-Jurnal Matematika ◽

10.24843/mtk.2021.v10.i04.p348 ◽

2021 ◽

Vol 10 (4) ◽

pp. 235

Author(s):

NI PUTU YULIKA TRISNA WIJAYANTI ◽

EKA N. KENCANA ◽

I WAYAN SUMARJAYA

Keyword(s):

Decision Making ◽

Real World ◽

Imbalanced Data ◽

Classification Performance ◽

Focus Of Attention ◽

Minority Class ◽

Algorithm Level ◽

Data Level

Imbalanced data is a problem that is often found in real-world cases of classification. Imbalanced data causes misclassification will tend to occur in the minority class. This can lead to errors in decision-making if the minority class has important information and it’s the focus of attention in research. Generally, there are two approaches that can be taken to deal with the problem of imbalanced data, the data level approach and the algorithm level approach. The data level approach has proven to be very effective in dealing with imbalanced data and more flexible. The oversampling method is one of the data level approaches that generally gives better results than the undersampling method. SMOTE is the most popular oversampling method used in more applications. In this study, we will discuss in more detail the SMOTE method, potential, and disadvantages of this method. In general, this method is intended to avoid overfitting and improve classification performance in the minority class. However, this method also causes overgeneralization which tends to be overlapping.

Download Full-text