Support Vector based Oversampling Technique for Handling Class Imbalance in Software Defect Prediction

AbstractPredicting the defect-prone modules when the previous defect labels of modules are limited is a challenging problem encountered in the software industry. Supervised classification approaches cannot build high-performance prediction models with few defect data, leading to the need for new methods, techniques, and tools. One solution is to combine labeled data points with unlabeled data points during learning phase. Semi-supervised classification methods use not only labeled data points but also unlabeled ones to improve the generalization capability. In this study, we evaluated four semi-supervised classification methods for semi-supervised defect prediction. Low-density separation (LDS), support vector machine (SVM), expectation-maximization (EM-SEMI), and class mass normalization (CMN) methods have been investigated on NASA data sets, which are CM1, KC1, KC2, and PC1. Experimental results showed that SVM and LDS algorithms outperform CMN and EM-SEMI algorithms. In addition, LDS algorithm performs much better than SVM when the data set is large. In this study, the LDS-based prediction approach is suggested for software defect prediction when there are limited fault data.

Download Full-text

A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction

2012 11th International Conference on Machine Learning and Applications ◽

10.1109/icmla.2012.145 ◽

2012 ◽

Cited By ~ 4

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Hybrid Approach ◽

Class Imbalance ◽

High Dimensionality ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

A Novel Effort Measure Method for Effort-Aware Just-in-Time Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021500364 ◽

2021 ◽

Vol 31 (08) ◽

pp. 1145-1169

Author(s):

Liqiong Chen ◽

Shilong Song ◽

Can Wang

Keyword(s):

Prediction Model ◽

Defect Prediction ◽

Software Systems ◽

Support Vector ◽

Just In Time ◽

Software Defect Prediction ◽

Fine Grained ◽

Software Defect ◽

Code Changes ◽

The Cost

Just-in-time software defect prediction (JIT-SDP) is a fine-grained software defect prediction technology, which aims to identify the defective code changes in software systems. Effort-aware software defect prediction is a software defect prediction technology that takes into consideration the cost of code inspection, which can find more defective code changes in limited test resources. The traditional effort-aware defect prediction model mainly measures the effort based on the number of lines of code (LOC) and rarely considers additional factors. This paper proposes a novel effort measure method called Multi-Metric Joint Calculation (MMJC). When measuring the effort, MMJC takes into account not only LOC, but also the distribution of modified code across different files (Entropy), the number of developers that changed the files (NDEV) and the developer experience (EXP). In the simulation experiment, MMJC is combined with Linear Regression, Decision Tree, Random Forest, LightGBM, Support Vector Machine and Neural Network, respectively, to build the software defect prediction model. Several comparative experiments are conducted between the models based on MMJC and baseline models. The results show that indicators ACC and [Formula: see text] of the models based on MMJC are improved by 35.3% and 15.9% on average in the three verification scenarios, respectively, compared with the baseline models.

Download Full-text

Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics

Engineering Applications of Neural Networks - Communications in Computer and Information Science ◽

10.1007/978-3-642-03969-0_21 ◽

2009 ◽

pp. 223-234 ◽

Cited By ~ 33

Author(s):

David Gray ◽

David Bowes ◽

Neil Davey ◽

Yi Sun ◽

Bruce Christianson

Keyword(s):

Support Vector Machine ◽

Classification Method ◽

Defect Prediction ◽

Support Vector ◽

Software Defect Prediction ◽

Software Defect ◽

Code Metrics

Download Full-text

Omni-Ensemble Learning (OEL): Utilizing Over-Bagging, Static and Dynamic Ensemble Selection Approaches for Software Defect Prediction

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213018500240 ◽

2018 ◽

Vol 27 (06) ◽

pp. 1850024 ◽

Cited By ~ 4

Author(s):

Reza Mousavi ◽

Mahdi Eftekhari ◽

Farhad Rahdari

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Defect Prediction ◽

Learning Approaches ◽

Software Defect Prediction ◽

Multiple Classifier Systems ◽

Classifier Systems ◽

Software Defect ◽

Ensemble Selection ◽

Dynamic Ensemble Selection

Machine learning methods in software engineering are becoming increasingly important as they can improve quality and testing efficiency by constructing models to predict defects in software modules. The existing datasets for software defect prediction suffer from an imbalance of class distribution which makes the learning problem in such a task harder. In this paper, we propose a novel approach by integrating Over-Bagging, static and dynamic ensemble selection strategies. The proposed method utilizes most of ensemble learning approaches called Omni-Ensemble Learning (OEL). This approach exploits a new Over-Bagging method for class imbalance learning in which the effect of three different methods of assigning weight to training samples is investigated. The proposed method first specifies the best classifiers along with their combiner for all test samples through Genetic Algorithm as the static ensemble selection approach. Then, a subset of the selected classifiers is chosen for each test sample as the dynamic ensemble selection. Our experiments confirm that the proposed OEL can provide better overall performance (in terms of G-mean, balance, and AUC measures) comparing with other six related works and six multiple classifier systems over seven NASA datasets. We generally recommend OEL to improve the performance of software defect prediction and the similar problem based on these experimental results.

Download Full-text

Software Defect Prediction Using Dynamic Support Vector Machine

2013 Ninth International Conference on Computational Intelligence and Security ◽

10.1109/cis.2013.61 ◽

2013 ◽

Cited By ~ 6

Author(s):

Bo Shuai ◽

Haifeng Li ◽

Mengjun Li ◽

Quan Zhang ◽

Chaojing Tang

Keyword(s):

Support Vector Machine ◽

Defect Prediction ◽

Support Vector ◽

Software Defect Prediction ◽

Software Defect ◽

Dynamic Support

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Information Systems ◽

10.1016/j.is.2015.02.006 ◽

2015 ◽

Vol 51 ◽

pp. 62-71 ◽

Cited By ~ 56

Author(s):

Michael J. Siers ◽

Md Zahidul Islam

Keyword(s):

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Potential Solution ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Defect ◽

Decision Forest

Download Full-text

Support Vector based Oversampling Technique for Handling Class Imbalance in Software Defect Prediction

Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction

A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction

A Novel Effort Measure Method for Effort-Aware Just-in-Time Software Defect Prediction

Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics

Omni-Ensemble Learning (OEL): Utilizing Over-Bagging, Static and Dynamic Ensemble Selection Approaches for Software Defect Prediction

Software Defect Prediction Using Dynamic Support Vector Machine

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Export Citation Format