scholarly journals SMOTE: Synthetic Minority Over-sampling Technique

2002 ◽  
Vol 16 ◽  
pp. 321-357 ◽  
Author(s):  
N. V. Chawla ◽  
K. W. Bowyer ◽  
L. O. Hall ◽  
W. P. Kegelmeyer

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

2018 ◽  
Vol 10 (11) ◽  
pp. 1689 ◽  
Author(s):  
Min Ji ◽  
Lanfa Liu ◽  
Manfred Buchroithner

Earthquake is one of the most devastating natural disasters that threaten human life. It is vital to retrieve the building damage status for planning rescue and reconstruction after an earthquake. In cases when the number of completely collapsed buildings is far less than intact or less-affected buildings (e.g., the 2010 Haiti earthquake), it is difficult for the classifier to learn the minority class samples, due to the imbalance learning problem. In this study, the convolutional neural network (CNN) was utilized to identify collapsed buildings from post-event satellite imagery with the proposed workflow. Producer accuracy (PA), user accuracy (UA), overall accuracy (OA), and Kappa were used as evaluation metrics. To overcome the imbalance problem, random over-sampling, random under-sampling, and cost-sensitive methods were tested on selected test A and test B regions. The results demonstrated that the building collapsed information can be retrieved by using post-event imagery. SqueezeNet performed well in classifying collapsed and non-collapsed buildings, and achieved an average OA of 78.6% for the two test regions. After balancing steps, the average Kappa value was improved from 41.6% to 44.8% with the cost-sensitive approach. Moreover, the cost-sensitive method showed a better performance on discriminating collapsed buildings, with a PA value of 51.2% for test A and 61.1% for test B. Therefore, a suitable balancing method should be considered when facing imbalance dataset to retrieve the distribution of collapsed buildings.


Author(s):  
Liwei Fan ◽  
Kim Leng Poh

A Bayesian Network (BN) takes a relationship between graphs and probability distributions. In the past, BN was mainly used for knowledge representation and reasoning. Recent years have seen numerous successful applications of BN in classification, among which the Naïve Bayes classifier was found to be surprisingly effective in spite of its simple mechanism (Langley, Iba & Thompson, 1992). It is built upon the strong assumption that different attributes are independent with each other. Despite of its many advantages, a major limitation of using the Naïve Bayes classifier is that the real-world data may not always satisfy the independence assumption among attributes. This strong assumption could make the prediction accuracy of the Naïve Bayes classifier highly sensitive to the correlated attributes. To overcome the limitation, many approaches have been developed to improve the performance of the Naïve Bayes classifier. This article gives a brief introduction to the approaches which attempt to relax the independence assumption among attributes or use certain pre-processing procedures to make the attributes as independent with each other as possible. Previous theoretical and empirical results have shown that the performance of the Naïve Bayes classifier can be improved significantly by using these approaches, while the computational complexity will also increase to a certain extent.


Author(s):  
MURAT KURTCEPHE ◽  
H. ALTAY GÜVENIR

Many machine learning algorithms require the features to be categorical. Hence, they require all numeric-valued data to be discretized into intervals. In this paper, we present a new discretization method based on the receiver operating characteristics (ROC) Curve (AUC) measure. Maximum area under ROC curve-based discretization (MAD) is a global, static and supervised discretization method. MAD uses the sorted order of the continuous values of a feature and discretizes the feature in such a way that the AUC based on that feature is to be maximized. The proposed method is compared with alternative discretization methods such as ChiMerge, Entropy-Minimum Description Length Principle (MDLP), Fixed Frequency Discretization (FFD), and Proportional Discretization (PD). FFD and PD have been recently proposed and are designed for Naïve Bayes learning. ChiMerge is a merging discretization method as the MAD method. Evaluations are performed in terms of M-Measure, an AUC-based metric for multi-class classification, and accuracy values obtained from Naïve Bayes and Aggregating One-Dependence Estimators (AODE) algorithms by using real-world datasets. Empirical results show that MAD is a strong candidate to be a good alternative to other discretization methods.


2017 ◽  
Vol 3 (1) ◽  
pp. 1-6
Author(s):  
Ahmad Ilham

Masalah data kelas tidak seimbang memiliki efek buruk pada ketepatan prediksi data. Untuk menangani masalah ini, telah banyak penelitian sebelumnya menggunakan algoritma klasifikasi menangani masalah data kelas tidak seimbang. Pada penelitian ini akan menyajikan teknik under-sampling dan over-sampling untuk menangani data kelas tidak seimbang. Teknik ini akan digunakan pada tingkat preprocessing untuk menyeimbangkan kondisi kelas pada data. Hasil eksperimen menunjukkan neural network (NN) lebih unggul dari decision tree (DT), linear regression (LR), naïve bayes (NB) dan support vector machine (SVM).


2020 ◽  
Vol 8 (2) ◽  
pp. 89-93 ◽  
Author(s):  
Hairani Hairani ◽  
Khurniawan Eko Saputro ◽  
Sofiansyah Fadli

The occurrence of imbalanced class in a dataset causes the classification results to tend to the class with the largest amount of data (majority class). A sampling method is needed to balance the minority class (positive class) so that the class distribution becomes balanced and leading to better classification results. This study was conducted to overcome imbalanced class problems on the Indian Pima diabetes illness dataset using k-means-SMOTE. The dataset has 268 instances of the positive class (minority class) and 500 instances of the negative class (majority class). The classification was done by comparing C4.5, SVM, and naïve Bayes while implementing k-means-SMOTE in data sampling. Using k-means-SMOTE, the SVM classification method has the highest accuracy and sensitivity of 82 % and 77 % respectively, while the naive Bayes method produces the highest specificity of 89 %.


2018 ◽  
Author(s):  
Ahmad Ilham

Saat ini data real dari berbagai sumber sangat banyak mengandung data dengan kelas tidak seimbang. Masalah data kelas tidak seimbang dapat menimbulkan efek buruk pada metode klasifikasi untuk ketepatan prediksi pada data. Untuk menangani masalah ini, telah banyak penelitian sebelumnya menggunakan algoritma klasifikasi menangani masalah data kelas tidak seimbang. Pada penelitian ini akan menyajikan teknik under-sampling dan over-sampling untuk menangani data kelas tidak seimbang. Teknik ini akan digunakan pada tingkat preprocessing untuk menyeimbangkan kondisi kelas pada data. Hasil eksperimen menunjukkan neural network (NN) lebih unggul dari decision tree (DT), linear regression (LR), naïve bayes (NB) dan support vector machine (SVM).


Author(s):  
M. Aldiki Febriantono ◽  
Sholeh Hadi Pramono ◽  
Rahmadwati Rahmadwati ◽  
Golshah Naghdy

The multiclass imbalanced data problems in data mining were an interesting to study currently. The problems had an influence on the classification process in machine learning processes. Some cases showed that minority class in the dataset had an important information value compared to the majority class. When minority class was misclassification, it would affect the accuracy value and classifier performance. In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems. The first stage, making the decision tree model uses the C5.0 algorithm then the cost sensitive learning uses the metacost method to obtain the minimum cost model. The results of testing the C5.0 algorithm had better performance than C4.5 and ID3 algorithms. The percentage of algorithm performance from C5.0, C4.5 and ID3 were 40.91%, 40, 24% and 19.23%.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 2122-2133 ◽  
Author(s):  
Christos K. Aridas ◽  
Stamatis Karlos ◽  
Vasileios G. Kanas ◽  
Nikos Fazakis ◽  
Sotiris B. Kotsiantis

Author(s):  
LIANGXIAO JIANG ◽  
DIANHONG WANG ◽  
HARRY ZHANG ◽  
ZHIHUA CAI ◽  
BO HUANG

Improving naive Bayes (simply NB)15,28 for classification has received significant attention. Related work can be broadly divided into two approaches: eager learning and lazy learning.1 Different from eager learning, the key idea for extending naive Bayes using lazy learning is to learn an improved naive Bayes for each test instance. In recent years, several lazy extensions of naive Bayes have been proposed. For example, LBR,30 SNNB,27 and LWNB.8 All these algorithms aim to improve naive Bayes' classification performance. Indeed, they achieve significant improvement in terms of classification, measured by accuracy. In many real-world data mining applications, however, an accurate ranking is more desirable than an accurate classification. Thus a natural question is whether they also achieve significant improvement in terms of ranking, measured by AUC (the area under the ROC curve).2,11,17 Responding to this question, we conduct experiments on the 36 UCI data sets18 selected by Weka12 to investigate their ranking performance and find that they do not significantly improve the ranking performance of naive Bayes. Aiming at scaling up naive Bayes' ranking performance, we present a novel lazy method ICNB (instance cloned naive Bayes) and develop three ICNB algorithms using different instance cloning strategies. We empirically compare them with naive Bayes. The experimental results show that our algorithms achieve significant improvement in terms of AUC. Our research provides a simple but effective method for the applications where an accurate ranking is desirable.


Many fraud transactions exist in the online world that affects various financial institutions but Credit Card Fraud transaction is the most occurring problem in the world. Credit Card fraud is the situation in which fraudsters misuse credit cards for illegal purposes. Hence, detection of fraudulent transactions is essen-tial. Several researchers have worked on detecting fraud transactions and also provide solutions whose surveys are given in this paper. This study makes a major contribution to research on the detection of Credit Card fraud transactions through Machine Learning Algorithms suchas Decision Tree and Naive Bayes. The data have been selected from Kag-gle and categorize into training (80%) and testing (20%) data. The whole experiment was performed on the Jupyter Notebook tool for which the Anaconda Navigator has been installed. The Heatmap is used for visualization and colorfully represents the data. The main aim of this work is to balance the dataset with Near-Miss Under-sampling Method. The information gain method is applied for feature selection. The best algorithm founded in this paper is Decision Tree with 97% accuracy as compared to Naïve Bayes with 90%. The results are achieved based on Accuracy, Recall, Precision, and F1-score. We have also shown the ROC Curve and Precision-Recall Curve of the algorithm in this paper.


Sign in / Sign up

Export Citation Format

Share Document