Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction

Machine learning methods in software engineering are becoming increasingly important as they can improve quality and testing efficiency by constructing models to predict defects in software modules. The existing datasets for software defect prediction suffer from an imbalance of class distribution which makes the learning problem in such a task harder. In this paper, we propose a novel approach by integrating Over-Bagging, static and dynamic ensemble selection strategies. The proposed method utilizes most of ensemble learning approaches called Omni-Ensemble Learning (OEL). This approach exploits a new Over-Bagging method for class imbalance learning in which the effect of three different methods of assigning weight to training samples is investigated. The proposed method first specifies the best classifiers along with their combiner for all test samples through Genetic Algorithm as the static ensemble selection approach. Then, a subset of the selected classifiers is chosen for each test sample as the dynamic ensemble selection. Our experiments confirm that the proposed OEL can provide better overall performance (in terms of G-mean, balance, and AUC measures) comparing with other six related works and six multiple classifier systems over seven NASA datasets. We generally recommend OEL to improve the performance of software defect prediction and the similar problem based on these experimental results.

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Information Systems ◽

10.1016/j.is.2015.02.006 ◽

2015 ◽

Vol 51 ◽

pp. 62-71 ◽

Cited By ~ 56

Author(s):

Michael J. Siers ◽

Md Zahidul Islam

Keyword(s):

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Potential Solution ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Defect ◽

Decision Forest

Download Full-text

Integrasi SMOTE pada Naive Bayes dan Logistic Regression Berbasis Particle Swarm Optimization untuk Prediksi Cacat Perangkat Lunak

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i1.2616 ◽

2021 ◽

Vol 5 (1) ◽

pp. 233

Author(s):

Andre Hardoni ◽

Dian Palupi Rini ◽

Sukemi Sukemi

Keyword(s):

Logistic Regression ◽

Particle Swarm Optimization ◽

Naive Bayes ◽

Class Imbalance ◽

Naïve Bayes ◽

Defect Prediction ◽

Software Defect Prediction ◽

Swarm Optimization ◽

Software Defect ◽

Classification Technique

Software defects are one of the main contributors to information technology waste and lead to rework, thus consuming a lot of time and money. Software defect prediction has the objective of defect prevention by classifying certain modules as defective or not defective. Many researchers have conducted research in the field of software defect prediction using NASA MDP public datasets, but these datasets still have shortcomings such as class imbalance and noise attribute. The class imbalance problem can be overcome by utilizing SMOTE (Synthetic Minority Over-sampling Technique) and the noise attribute problem can be solved by selecting features using Particle Swarm Optimization (PSO), So in this research, the integration between SMOTE and PSO is applied to the classification technique machine learning naïve Bayes and logistic regression. From the results of experiments that have been carried out on 8 NASA MDP datasets by dividing the dataset into training and testing data, it is found that the SMOTE + PSO integration in each classification technique can improve classification performance with the highest AUC (Area Under Curve) value on average 0,89 on logistic regression and 0,86 in naïve Bayes in the training and at the same time better than without combining the two.

Download Full-text

COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction

Information and Software Technology ◽

10.1016/j.infsof.2020.106432 ◽

2021 ◽

Vol 129 ◽

pp. 106432 ◽

Cited By ~ 1

Author(s):

Shuo Feng ◽

Jacky Keung ◽

Xiao Yu ◽

Yan Xiao ◽

Kwabena Ebo Bennin ◽

...

Keyword(s):

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Defect

Download Full-text

Software Defect Prediction Based on Cost-Sensitive Dictionary Learning

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194019500384 ◽

2019 ◽

Vol 29 (09) ◽

pp. 1219-1243 ◽

Cited By ~ 1

Author(s):

Hongyan Wan ◽

Guoqing Wu ◽

Mali Yu ◽

Mengting Yuan

Keyword(s):

Sparse Representation ◽

Dictionary Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Prediction Method ◽

Elastic Net ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Software Modules

Software defect prediction technology has been widely used in improving the quality of software system. Most real software defect datasets tend to have fewer defective modules than defective-free modules. Highly class-imbalanced data typically make accurate predictions difficult. The imbalanced nature of software defect datasets makes the prediction model classifying a defective module as a defective-free one easily. As there exists the similarity during the different software modules, one module can be represented by the sparse representation coefficients over the pre-defined dictionary which consists of historical software defect datasets. In this study, we make use of dictionary learning method to predict software defect. We optimize the classifier parameters and the dictionary atoms iteratively, to ensure that the extracted features (sparse representation) are optimal for the trained classifier. We prove the optimal condition of the elastic net which is used to solve the sparse coding coefficients and the regularity of the elastic net solution. Due to the reason that the misclassification of defective modules generally incurs much higher cost risk than the misclassification of defective-free ones, we take the different misclassification costs into account, increasing the punishment on misclassification defective modules in the procedure of dictionary learning, making the classification inclining to classify a module as a defective one. Thus, we propose a cost-sensitive software defect prediction method using dictionary learning (CSDL). Experimental results on the 10 class-imbalance datasets of NASA show that our method is more effective than several typical state-of-the-art defect prediction methods.

Download Full-text

Genetic algorithm-based oversampling approach to prune the class imbalance issue in software defect prediction

Soft Computing ◽

10.1007/s00500-021-06112-6 ◽

2021 ◽

Author(s):

C. Arun ◽

C. Lakshmi

Keyword(s):

Genetic Algorithm ◽

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text