Software Defect Prediction Using Propositionalization Based Data Preprocessing: An Empirical Study

Different data preprocessing methods and classifiers have been established and evaluated earlier for the software defect prediction (SDP) across projects. These novel approaches have provided relatively acceptable prediction results for different software projects. However, to the best of our knowledge, few researchers have combined data preprocessing and building robust classifier simultaneously to improve prediction performances in SDP. Therefore, this paper presents a new whole framework for predicting fault-prone software modules. The proposed framework consists of instance filtering, feature selection, instance reduction, and establishing a new classifier. Additionally, we find that the 21 main software metrics commonly do follow nonnormal distribution after performing a Kolmogorov-Smirnov test. Therefore, the newly proposed classifier is built on the maximum correntropy criterion (MCC). The MCC is well-known for its effectiveness in handling non-Gaussian noise. To evaluate the new framework, the experimental study is designed with due care using nine open-source software projects with their 32 releases, obtained from the PROMISE data repository. The prediction accuracy is evaluated using F-measure. The state-of-the-art methods for Cross-Project Defect Prediction are also included for comparison. All of the evidences derived from the experimentation verify the effectiveness and robustness of our new framework.

Download Full-text

An Empirical Study of Model-Agnostic Interpretation Technique for Just-in-Time Software Defect Prediction

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Collaborative Computing: Networking, Applications and Worksharing ◽

10.1007/978-3-030-92635-9_25 ◽

2021 ◽

pp. 420-438

Author(s):

Xingguang Yang ◽

Huiqun Yu ◽

Guisheng Fan ◽

Zijie Huang ◽

Kang Yang ◽

...

Keyword(s):

Empirical Study ◽

Defect Prediction ◽

Just In Time ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

An empirical study on pareto based multi-objective feature selection for software defect prediction

Journal of Systems and Software ◽

10.1016/j.jss.2019.03.012 ◽

2019 ◽

Vol 152 ◽

pp. 215-238 ◽

Cited By ~ 9

Author(s):

Chao Ni ◽

Xiang Chen ◽

Fangfang Wu ◽

Yuxiang Shen ◽

Qing Gu

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Defect Prediction ◽

Software Defect Prediction ◽

Multi Objective ◽

Software Defect ◽

Selection For

Download Full-text

An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

Neurocomputing ◽

10.1016/j.neucom.2018.04.090 ◽

2019 ◽

Vol 343 ◽

pp. 120-140 ◽

Cited By ~ 11

Author(s):

Ruchika Malhotra ◽

Shine Kamal

Keyword(s):

Empirical Study ◽

Imbalanced Data ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Combining Data Preprocessing Methods With Imputation Techniques for Software Defect Prediction

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2018010101 ◽

2018 ◽

Vol 9 (1) ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P.S. Grover

Keyword(s):

Linear Regression ◽

Missing Values ◽

Data Preprocessing ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Combining Data ◽

Feature Selector ◽

Traditional Performance

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.

Download Full-text

An empirical study on software defect prediction with a simplified metric set

Information and Software Technology ◽

10.1016/j.infsof.2014.11.006 ◽

2015 ◽

Vol 59 ◽

pp. 170-190 ◽

Cited By ~ 104

Author(s):

Peng He ◽

Bing Li ◽

Xiao Liu ◽

Jun Chen ◽

Yutao Ma

Keyword(s):

Empirical Study ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Metric Set

Download Full-text

An Empirical Study on Software Defect Prediction Using CodeBERT Model

Applied Sciences ◽

10.3390/app11114793 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4793

Author(s):

Cong Pan ◽

Minyan Lu ◽

Biao Xu

Keyword(s):

Deep Learning ◽

Software Engineering ◽

Empirical Study ◽

Empirical Studies ◽

Language Model ◽

Prediction Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Cross Project

Deep learning-based software defect prediction has been popular these days. Recently, the publishing of the CodeBERT model has made it possible to perform many software engineering tasks. We propose various CodeBERT models targeting software defect prediction, including CodeBERT-NT, CodeBERT-PS, CodeBERT-PK, and CodeBERT-PT. We perform empirical studies using such models in cross-version and cross-project software defect prediction to investigate if using a neural language model like CodeBERT could improve prediction performance. We also investigate the effects of different prediction patterns in software defect prediction using CodeBERT models. The empirical results are further discussed.

Download Full-text