SMOTE: POTENSI DAN KEKURANGANNYA PADA SURVEI

Imbalanced data is a problem that is often found in real-world cases of classification. Imbalanced data causes misclassification will tend to occur in the minority class. This can lead to errors in decision-making if the minority class has important information and it’s the focus of attention in research. Generally, there are two approaches that can be taken to deal with the problem of imbalanced data, the data level approach and the algorithm level approach. The data level approach has proven to be very effective in dealing with imbalanced data and more flexible. The oversampling method is one of the data level approaches that generally gives better results than the undersampling method. SMOTE is the most popular oversampling method used in more applications. In this study, we will discuss in more detail the SMOTE method, potential, and disadvantages of this method. In general, this method is intended to avoid overfitting and improve classification performance in the minority class. However, this method also causes overgeneralization which tends to be overlapping.

Download Full-text

Machine learning with asymmetric abstention for biomedical decision-making

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01655-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Mariem Gandouz ◽

Hajo Holzmann ◽

Dominik Heider

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Decision Making ◽

Real World ◽

Computational Models ◽

Imbalanced Data ◽

Classification Performance ◽

Biomedical Data ◽

Decision Boundary ◽

Human Decision

AbstractMachine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.

Download Full-text

An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data

Mathematical Problems in Engineering ◽

10.1155/2019/3526539 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Wenhao Xie ◽

Gongqian Liang ◽

Zhonghui Dong ◽

Baoyu Tan ◽

Baosheng Zhang

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Mean Value ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Imbalanced Data Classification ◽

Samples Selection ◽

Data Level

The imbalance data refers to at least one of its classes which is usually outnumbered by the other classes. The imbalanced data sets exist widely in the real world, and the classification for them has become one of the hottest issues in the field of data mining. At present, the classification solutions for imbalanced data sets are mainly based on the algorithm-level and the data-level. On the data-level, both oversampling strategies and undersampling strategies are used to realize the data balance via data reconstruction. SMOTE and Random-SMOTE are two classic oversampling algorithms, but they still possess the drawbacks such as blind interpolation and fuzzy class boundaries. In this paper, an improved oversampling algorithm based on the samples’ selection strategy for the imbalanced data classification is proposed. On the basis of the Random-SMOTE algorithm, the support vectors (SV) are extracted and are treated as the parent samples to synthesize the new examples for the minority class in order to realize the balance of the data. Lastly, the imbalanced data sets are classified with the SVM classification algorithm. F-measure value, G-mean value, ROC curve, and AUC value are selected as the performance evaluation indexes. Experimental results show that this improved algorithm demonstrates a good classification performance for the imbalanced data sets.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

Imbalanced Data Classification Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.443.741 ◽

2013 ◽

Vol 443 ◽

pp. 741-745

Author(s):

Hu Li ◽

Peng Zou ◽

Wei Hong Han ◽

Rong Ze Xia

Keyword(s):

Real World ◽

Imbalanced Data ◽

Data Classification ◽

Comprehensive Analysis ◽

Classification Method ◽

Classification Methods ◽

Real World Data ◽

Minority Class ◽

Imbalanced Data Classification ◽

Traditional Classification

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.

Download Full-text

Imbalanced Data Sets Classification Based on SVM for Sand-Dust Storm Warning

Discrete Dynamics in Nature and Society ◽

10.1155/2015/562724 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8

Author(s):

Yonghua Xie ◽

Yurong Liu ◽

Qingqiu Fu

Keyword(s):

Dust Storm ◽

Adaptive Sampling ◽

Imbalanced Data ◽

Real Data ◽

Classification Performance ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Redundant Data ◽

Sand Dust

In view of the SVM classification for the imbalanced sand-dust storm data sets, this paper proposes a hybrid self-adaptive sampling method named SRU-AIBSMOTE algorithm. This method can adaptively adjust neighboring selection strategy based on the internal distribution of sample sets. It produces virtual minority class instances through randomized interpolation in the spherical space which consists of minority class instances and their neighbors. The random undersampling is also applied to undersample the majority class instances for removal of redundant data in the sample sets. The comparative experimental results on the real data sets from Yanchi and Tongxin districts in Ningxia of China show that the SRU-AIBSMOTE method can obtain better classification performance than some traditional classification methods.

Download Full-text

Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method

Algorithms ◽

10.3390/a14020054 ◽

2021 ◽

Vol 14 (2) ◽

pp. 54

Author(s):

Chen Fu ◽

Jianhua Yang

Keyword(s):

Imbalanced Data ◽

Main Idea ◽

Fuzzy Rule ◽

Classification Performance ◽

Distance Measures ◽

Minkowski Distance ◽

Imbalanced Datasets ◽

Minority Class ◽

Information Granules ◽

Practical Applications

The problem of classification for imbalanced datasets is frequently encountered in practical applications. The data to be classified in this problem are skewed, i.e., the samples of one class (the minority class) are much less than those of other classes (the majority class). When dealing with imbalanced datasets, most classifiers encounter a common limitation, that is, they often obtain better classification performances on the majority classes than those on the minority class. To alleviate the limitation, in this study, a fuzzy rule-based modeling approach using information granules is proposed. Information granules, as some entities derived and abstracted from data, can be used to describe and capture the characteristics (distribution and structure) of data from both majority and minority classes. Since the geometric characteristics of information granules depend on the distance measures used in the granulation process, the main idea of this study is to construct information granules on each class of imbalanced data using Minkowski distance measures and then to establish the classification models by using “If-Then” rules. The experimental results involving synthetic and publicly available datasets reflect that the proposed Minkowski distance-based method can produce information granules with a series of geometric shapes and construct granular models with satisfying classification performance for imbalanced datasets.

Download Full-text

A Survey on Imbalanced Data Handling Techniques for Classification

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/089102021 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1341-1347

Keyword(s):

Real World ◽

Imbalanced Data ◽

Learning Task ◽

High Accuracy ◽

Data Handling ◽

Imbalanced Dataset ◽

Minority Class ◽

Class Labels ◽

Very High ◽

F Measure

Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.

Download Full-text

Oversampling for Imbalanced Data via Optimal Transport

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015605 ◽

2019 ◽

Vol 33 ◽

pp. 5605-5612 ◽

Cited By ~ 1

Author(s):

Yuguang Yan ◽

Mingkui Tan ◽

Yanwu Xu ◽

Jiezhang Cao ◽

Michael Ng ◽

...

Keyword(s):

Real World ◽

Optimal Transport ◽

Imbalanced Data ◽

Data Sets ◽

Similar Distribution ◽

Real World Data ◽

Geometric Information ◽

Minority Class ◽

Real World Applications ◽

Multiple Metrics

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Download Full-text

Comprehensive Assessment of Imbalanced Data Classification

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d7349.049420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1426-1431

Keyword(s):

Real World ◽

Imbalanced Data ◽

Predictive Modelling ◽

Classification Problem ◽

Minority Class ◽

Unequal Distribution ◽

Imbalanced Classification ◽

Imbalanced Data Classification ◽

Improved Performance ◽

And Performance

This is an attempt to address the various challenges opportunities and scope for formulating and designing new procedure in imbalanced classification problem which poses a challenge to a predictive modelling as many of AI ML n DL algorithms which are extensively used for classification are always designed from the perspective of with majority of focus on assuming equal number of examples for a class. It leads to poor efficiency and performance especially in minority class. As Minority class is always very crucial and sensitive to classification errors and also its utmost important in imbalanced classification. This chapter discusses addresses and gives novel as well as deep insights with unequal distribution of classes in training datasets. Largely real time and real world classifications are comprising imbalanced distribution so need specialized techniques for more challenging and sophisticated models with minimal errors and improved performance.

Download Full-text

Imbalanced Data Fault Diagnosis Based on an Evolutionary Online Sequential Extreme Learning Machine

Symmetry ◽

10.3390/sym12081204 ◽

2020 ◽

Vol 12 (8) ◽

pp. 1204 ◽

Cited By ~ 3

Author(s):

Wei Hao ◽

Feng Liu

Keyword(s):

Fault Diagnosis ◽

Extreme Learning Machine ◽

High Speed ◽

Imbalanced Data ◽

Classification Performance ◽

Complex Data ◽

Minority Class ◽

Diagnosis Model ◽

Learning Machine ◽

Hidden Layer

To quickly and effectively identify an axle box bearing fault of high-speed electric multiple units (EMUs), an evolutionary online sequential extreme learning machine (OS-ELM) fault diagnosis method for imbalanced data was proposed. In this scheme, the resampling scale is first determined according to the resampling empirical formulation, the K-means synthetic minority oversampling technique (SMOTE) method is then used for oversampling the minority class samples, a method based on Euclidean distance is applied for undersampling the majority class samples, and the complex data features are extracted from the reconstructed dataset. Second, the reconstructed dataset is input into the diagnosis model. Finally, the artificial bee colony (ABC) algorithm is used to globally optimize the combination of input weights, hidden layer bias, and the number of hidden layer nodes for an OS-ELM, and the diagnosis model is allowed to evolve. The proposed method was tested on the axle box bearing monitoring data of high-speed EMUs, on which the position of the axle box bearings was symmetrical. Numerical testing proved that the method has the characteristics of faster detection and higher classification performance regarding the minority class data compared to other standard and classical algorithms.

Download Full-text