A Novel Data Representation for Effective Learning in Class Imbalanced Scenarios

Class imbalance refers to the scenario where certain classes are highly under-represented compared to other classes in terms of the availability of training data. This situation hinders the applicability of conventional machine learning algorithms to most of the classification problems where class imbalance is prominent. Most existing methods addressing class imbalance either rely on sampling techniques or cost-sensitive learning methods; thus inheriting their shortcomings. In this paper, we introduce a novel approach that is different from sampling or cost-sensitive learning based techniques, to address the class imbalance problem, where two samples are simultaneously considered to train the classifier. Further, we propose a mechanism to use a single base classifier, instead of an ensemble of classifiers, to obtain the output label of the test sample using majority voting method. Experimental results on several benchmark datasets clearly indicate the usefulness of the proposed approach over the existing state-of-the-art techniques.

Download Full-text

A Novel Minority Cloning Technique for Cost-Sensitive Learning

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415510040 ◽

2015 ◽

Vol 29 (04) ◽

pp. 1551004 ◽

Cited By ~ 19

Author(s):

Liangxiao Jiang ◽

Chen Qiu ◽

Chaoqun Li

Keyword(s):

Class Imbalance ◽

Training Data ◽

Class Imbalance Problem ◽

Minority Class ◽

Cost Sensitive Learning ◽

Class Distribution ◽

Imbalance Problem ◽

Cloning Technique ◽

Replacement Technique ◽

Better Than

In many real-world applications, it is often the case that the class distribution of instances is imbalanced and the costs of misclassification are different. Thus, the class-imbalanced cost-sensitive learning has attracted much attention from researchers. Sampling is one of the widely used techniques in dealing with the class-imbalance problem, which alters the class distribution of instances so that the minority class is well represented in the training data. In this paper, we propose a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning. MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class. The experimental results on a large number of UCI datasets show that MCT performs much better than Minority Oversampling with Replacement Technique (MORT) and Synthetic Minority Oversampling TEchnique (SMOTE) in terms of the total misclassification costs of the built classifiers.

Download Full-text

A Framework for Homogeneous Cross-Project Defect Prediction

International Journal of Software Innovation ◽

10.4018/ijsi.2021010105 ◽

2021 ◽

Vol 9 (1) ◽

pp. 52-68

Author(s):

Lipika Goel ◽

Mayank Sharma ◽

Sunil Kumar Khatri ◽

D. Damodaran

Keyword(s):

Class Imbalance ◽

The Other ◽

Training Data ◽

Defect Prediction ◽

Rank Test ◽

Ensemble Model ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Proposed Model ◽

Cross Project

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.

Download Full-text

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419590377 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1959037 ◽

Cited By ~ 5

Author(s):

Shaojian Qiu ◽

Lu Lu ◽

Siyu Jiang ◽

Yang Guo

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Training Data ◽

Defect Prediction ◽

Class Imbalance Problem ◽

Learning Methods ◽

Imbalance Problem ◽

Intelligent Software ◽

Under Sampling ◽

Cross Project

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Download Full-text

A Novel Approach for Handling Outliers in Imbalanced Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.1.16783 ◽

2018 ◽

Vol 7 (3.1) ◽

pp. 1 ◽

Cited By ~ 1

Author(s):

Gillala Rekha ◽

V Krishna Reddy

Keyword(s):

Class Imbalance ◽

Training Data ◽

Classification Algorithms ◽

Main Concern ◽

Data Generation ◽

Class Imbalance Problem ◽

Class Distribution ◽

Novel Approach ◽

Real World Datasets ◽

Data Level

Most of the traditional classification algorithms assume their training data to be well-balanced in terms of class distribution. Real-world datasets, however, are imbalanced in nature thus degrade the performance of the traditional classifiers. To solve this problem, many strategies are adopted to balance the class distribution at the data level. The data level methods balance the imbalance distribution between majority and minority classes using either oversampling or under sampling techniques. The main concern of this paper is to remove the outliers that may generate while using oversampling techniques. In this study, we proposed a novel approach for solving the class imbalance problem at data level by using modified SMOTE to remove the outliers that may exist after synthetic data generation using SMOTE oversampling technique. We extensively compare our approach with SMOTE, SMOTE+ENN, SMOTE+Tomek-Link using 9 datasets from keel repository using classification algorithms. The result reveals that our approach improves the prediction performance for most of the classification algorithms and achieves better performance compared to the existing approaches.

Download Full-text

A New Diversity Technique for Imbalance Learning Ensembles

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11251 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 478 ◽

Cited By ~ 2

Author(s):

Hartono . ◽

Opim Salim Sitompul ◽

Erna Budhiarti Nababan ◽

Tulus . ◽

Dahlan Abdullah ◽

...

Keyword(s):

Hybrid Approach ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Classifier Ensembles ◽

Classification Problems ◽

Class Imbalance Problem ◽

Weighting Method ◽

Imbalance Problem ◽

Learning Ensembles ◽

Imbalance Learning

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.

Download Full-text

Experimental Study on Class Imbalance Problem Using an Oil Spill Training Data Set

British Journal of Mathematics & Computer Science ◽

10.9734/bjmcs/2017/32860 ◽

2017 ◽

Vol 21 (5) ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Xi Ouyang ◽

Yuan Chen ◽

Bing Wei

Keyword(s):

Experimental Study ◽

Oil Spill ◽

Class Imbalance ◽

Training Data ◽

Class Imbalance Problem ◽

Data Set ◽

Imbalance Problem

Download Full-text

Image Classifying Based on Cost-Sensitive Layered Cascade Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.453 ◽

2014 ◽

Vol 701-702 ◽

pp. 453-458

Author(s):

Feng Huang ◽

Yun Liang ◽

Li Huang ◽

Ji Ming Yao ◽

Wen Feng Tian

Keyword(s):

Image Classification ◽

Class Imbalance ◽

Classification Performance ◽

Machine Learning Algorithms ◽

Class Imbalance Problem ◽

Data Set ◽

Misclassification Cost ◽

Imbalance Problem ◽

Specific Category ◽

The Cost

Image Classification is an important means of image processing, Traditional research of image classification usually based on following assumptions: aiming for the overall classification accuracy, sample of different category has the same importance in data set and all the misclassification brings same cost. Unfortunately, class imbalance and cost sensitive are ubiquitous in classification in real world process, sample size of specific category in data set may much more than others and misclassification cost is sharp distinction between different categories. High dimension of eigenvector caused by diversity content of images and the big complexity gap between distinguish different categories of images are common problems when dealing with image Classification, therefore, one single machine learning algorithms is not sufficient when dealing with complex image classification contains the above characteristics. To cure the above problems, a layered cascade image classifying method based on cost-sensitive and class-imbalance was proposed, a set of cascading learning was build, and the inner patterns of images of specific category was learned in different stages, also, the cost function was introduced, thus, the method can effectively respond to the cost-sensitive and class-imbalance problem of image classifying. Moreover, the structure of this method is flexible as the layer of cascading and the algorithm in every stage can be readjusted based on business requirements of image classifying. The result of application in sensitive image classifying for smart grid indicates that this image classifying based on cost-sensitive layered cascade learning obtains better image classification performance than the existing methods.

Download Full-text

Few-Shot Learning for Post-Earthquake Urban Damage Detection

Remote Sensing ◽

10.3390/rs14010040 ◽

2021 ◽

Vol 14 (1) ◽

pp. 40

Author(s):

Eftychia Koukouraki ◽

Leonardo Vanneschi ◽

Marco Painho

Keyword(s):

Damage Assessment ◽

Binary Classification ◽

Training Data ◽

Classification Problems ◽

Deep Convolutional Neural Networks ◽

Damage Classification ◽

Emergency Relief ◽

Cost Sensitive Learning ◽

Urban Structures ◽

Address Data

Among natural disasters, earthquakes are recorded to have the highest rates of human loss in the past 20 years. Their unexpected nature has severe consequences on both human lives and material infrastructure, demanding urgent action to be taken. For effective emergency relief, it is necessary to gain awareness about the level of damage in the affected areas. The use of remotely sensed imagery is popular in damage assessment applications; however, it requires a considerable amount of labeled data, which are not always easy to obtain. Taking into consideration the recent developments in the fields of Machine Learning and Computer Vision, this study investigates and employs several Few-Shot Learning (FSL) strategies in order to address data insufficiency and imbalance in post-earthquake urban damage classification. While small datasets have been tested against binary classification problems, which usually divide the urban structures into collapsed and non-collapsed, the potential of limited training data in multi-class classification has not been fully explored. To tackle this gap, four models were created, following different data balancing methods, namely cost-sensitive learning, oversampling, undersampling and Prototypical Networks. After a quantitative comparison among them, the best performing model was found to be the one based on Prototypical Networks, and it was used for the creation of damage assessment maps. The contribution of this work is twofold: we show that oversampling is the most suitable data balancing method for training Deep Convolutional Neural Networks (CNN) when compared to cost-sensitive learning and undersampling, and we demonstrate the appropriateness of Prototypical Networks in the damage classification context.

Download Full-text

Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling

Energies ◽

10.3390/en15010212 ◽

2021 ◽

Vol 15 (1) ◽

pp. 212

Author(s):

Ajit Kumar ◽

Neetesh Saxena ◽

Souhwan Jung ◽

Bong Jun Choi

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Skewed Distribution ◽

Critical Infrastructures ◽

Detection Accuracy ◽

Class Imbalance Problem ◽

False Data Injection ◽

Injection Attacks ◽

Imbalance Problem

Critical infrastructures have recently been integrated with digital controls to support intelligent decision making. Although this integration provides various benefits and improvements, it also exposes the system to new cyberattacks. In particular, the injection of false data and commands into communication is one of the most common and fatal cyberattacks in critical infrastructures. Hence, in this paper, we investigate the effectiveness of machine-learning algorithms in detecting False Data Injection Attacks (FDIAs). In particular, we focus on two of the most widely used critical infrastructures, namely power systems and water treatment plants. This study focuses on tackling two key technical issues: (1) finding the set of best features under a different combination of techniques and (2) resolving the class imbalance problem using oversampling methods. We evaluate the performance of each algorithm in terms of time complexity and detection accuracy to meet the time-critical requirements of critical infrastructures. Moreover, we address the inherent skewed distribution problem and the data imbalance problem commonly found in many critical infrastructure datasets. Our results show that the considered minority oversampling techniques can improve the Area Under Curve (AUC) of GradientBoosting, AdaBoost, and kNN by 10–12%.

Download Full-text

Effects of Class Imbalance Using Machine Learning Algorithms

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2021010101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-17

Author(s):

Swati V. Narwane ◽

Sudhir D. Sawarkar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Model Accuracy ◽

Data Set ◽

Class Distribution ◽

Imbalance Problem

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.

Download Full-text