Learning from Imbalanced Data in Classification

Imbalanced data learning is a research area and day by day development is going on. Due to these researchers are motivated to pay attention to find efficient and adaptive methods for real-world problems. Machine learning, as well as data mining, is a field where researchers are finding different methods to solve problems related to imbalanced datasets and also the challenges faced in day to day life. The uneven class distribution in the dataset is the reason behind the degradation of performance in approaches used by data mining as well as machine learning. Continuous advancements of machine learning as well as mining data combining it with big data, a deep insight is required to understand the nature of learning imbalanced data. New challenges are emerging due to this development. Among the two approaches algorithm level and data level, the most popular approach compared to this is the hybrid approach. It is found that there is a bias for the majority class which affects the decision making task and overall accuracy of classification. The ensemble method is an efficient technique to deal with the uneven distribution of data. The aim of the paper is to presents the overview of class imbalance problems, solutions to handle it, open issues and challenges in learning imbalanced datasets. Based on the experiment conducted on one dataset it is found that ensemble technique along with other data-level methods gives good results. This hybrid method can be applied in many real-life applications like software defect prediction, behavior analysis, intrusion detection, medical diagnosis, etc. The paper further provides research directions in learning from the imbalanced dataset.

Download Full-text

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

Integration of synthetic minority oversampling technique for imbalanced class

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i1.pp102-108 ◽

2019 ◽

Vol 13 (1) ◽

pp. 102

Author(s):

Noviyanti Santoso ◽

Wahyu Wibowo ◽

Hilda Hikmawati

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Class Imbalance ◽

Original Data ◽

Support Vector ◽

Classification Methods ◽

Problematic Issue ◽

Imbalanced Class ◽

F Measure

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2016040103 ◽

2016 ◽

Vol 7 (2) ◽

pp. 43-71 ◽

Cited By ~ 3

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Precision-Recall versus Accuracy and the Role of Large Data Sets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014039 ◽

2019 ◽

Vol 33 ◽

pp. 4039-4048 ◽

Cited By ~ 8

Author(s):

Brendan Juba ◽

Hai S. Le

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Large Data ◽

Constant Factor ◽

Data Sets ◽

Data Set ◽

Small Constant ◽

Classifier Performance ◽

Necessary And Sufficient

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.

Download Full-text

Exploration of Obesity Status of Indonesia Basic Health Research 2013 With Synthetic Minority Over-Sampling Techniques

Indonesian Journal of Statistics and Its Applications ◽

10.29244/ijsa.v5i1p75-91 ◽

2021 ◽

Vol 5 (1) ◽

pp. 75-91

Author(s):

Sri Astuti Thamrin ◽

Dian Sidik ◽

Hedi Kuswanto ◽

Armin Lawi ◽

Ansariadi Ansariadi

Keyword(s):

Machine Learning ◽

Health Research ◽

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Data Sets ◽

Data Set ◽

Basic Health ◽

Obesity Status ◽

Obese Class

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.

Download Full-text

Arabic Authorship Attribution Using Synthetic Minority Over-sampling Technique and Principal Components Analysis for Imbalanced Documents

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa20 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Principal Components Analysis ◽

Principal Components ◽

Average Length ◽

Hybrid Approach ◽

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Authorship Attribution ◽

Svm Classifier ◽

Components Analysis

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Download Full-text

Credit Card Fraud Detection Using Machine Learning Classification Algorithms over Highly Imbalanced Data

Issue 4 - Journal of Science and Technology ◽

10.46243/jst.2020.v5.i3.pp138-146 ◽

2020 ◽

Vol 5 (3) ◽

pp. 138-146 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Data Mining ◽

Credit Card ◽

Imbalanced Data ◽

Learning Systems ◽

Classification Algorithms ◽

Machine Learning Classification ◽

Information Models ◽

Mining Methods ◽

Do So

:Most online customers use cards to pay for their purchases. As charge cards become the most mainstream strategy for installment, instances of misrepresentation relationship with it too increases. The primary goal of this venture is to be ready to perceive false exchanges from non-fake exchanges. In request to do so,primarily,data mining methods are utilized to examine the examples and attributes of deceitful and non-fake transactions.Then,machine learning systems are utilized to foresee the fake and non-fake exchanges automatically. Algorithms LR (Logistic Regression) is used. Therefore, the blend of AI and information mining procedures are utilized to distinguish the fake and non-fake exchanges by learning the examples of the information. Models are made utilizing these calculations and afterward precision,accuracy,recall are determined and an examination is made.

Download Full-text

Imbalanced data classification based on hybrid resampling and twin support vector machine

Computer Science and Information Systems ◽

10.2298/csis161221017l ◽

2017 ◽

Vol 14 (3) ◽

pp. 579-595 ◽

Cited By ~ 2

Author(s):

Lu Cao ◽

Hong Shen

Keyword(s):

Support Vector Machine ◽

Real Life ◽

Imbalanced Data ◽

Data Classification ◽

Training Data ◽

Twin Support Vector Machine ◽

Support Vector ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalanced Data Classification

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.

Download Full-text

Imbalance class problems in data mining: a review

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v14.i3.pp1552-1563 ◽

2019 ◽

Vol 14 (3) ◽

pp. 1552 ◽

Cited By ~ 3

Author(s):

Haseeb Ali ◽

Mohd Najib Mohd Salleh ◽

Rohmat Saedudin ◽

Kashif Hussain ◽

Muhammad Faheem Mushtaq

Keyword(s):

Machine Learning ◽

Data Mining ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

The Other ◽

Minority Class ◽

Future Directions ◽

Significant Research ◽

Comprehensive Survey ◽

Imbalanced Class

<span>The imbalanced data problems in data mining are common nowadays, which occur due to skewed nature of data. These problems impact the classification process negatively in machine learning process. In such problems, classes have different ratios of specimens in which a large number of specimens belong to one class and the other class has fewer specimens that is usually an essential class, but unfortunately misclassified by many classifiers. So far, significant research is performed to address the imbalanced data problems by implementing different techniques and approaches. In this research, a comprehensive survey is performed to identify the challenges of handling imbalanced class problems during classification process using machine learning algorithms. We discuss the issues of classifiers which endorse bias for majority class and ignore the minority class. Furthermore, the viable solutions and potential future directions are provided to handle the problems<em>.</em></span>

Download Full-text