An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. The objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances’ similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. The results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. That is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.

Download Full-text

An Improved Method for Training Data Selection for Cross-Project Defect Prediction

Arabian Journal for Science and Engineering ◽

10.1007/s13369-021-06088-3 ◽

2021 ◽

Author(s):

Nayeem Ahmad Bhat ◽

Sheikh Umar Farooq

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Improved Method ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

A Framework for Homogeneous Cross-Project Defect Prediction

International Journal of Software Innovation ◽

10.4018/ijsi.2021010105 ◽

2021 ◽

Vol 9 (1) ◽

pp. 52-68

Author(s):

Lipika Goel ◽

Mayank Sharma ◽

Sunil Kumar Khatri ◽

D. Damodaran

Keyword(s):

Class Imbalance ◽

The Other ◽

Training Data ◽

Defect Prediction ◽

Rank Test ◽

Ensemble Model ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Proposed Model ◽

Cross Project

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.

Download Full-text

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419590377 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1959037 ◽

Cited By ~ 5

Author(s):

Shaojian Qiu ◽

Lu Lu ◽

Siyu Jiang ◽

Yang Guo

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Training Data ◽

Defect Prediction ◽

Class Imbalance Problem ◽

Learning Methods ◽

Imbalance Problem ◽

Intelligent Software ◽

Under Sampling ◽

Cross Project

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Download Full-text

Training data selection for cross-project defect prediction

Proceedings of the 9th International Conference on Predictive Models in Software Engineering - PROMISE '13 ◽

10.1145/2499393.2499395 ◽

2013 ◽

Cited By ~ 50

Author(s):

Steffen Herbold

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

Nonlinear Geometric Framework for Software Defect Prediction

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2020070105 ◽

2020 ◽

Vol 12 (3) ◽

pp. 85-100

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P. S. Grover

Keyword(s):

Ensemble Learning ◽

Historical Data ◽

Defect Prediction ◽

Software Defect Prediction ◽

Geometric Framework ◽

Minority Class ◽

Software Defect ◽

Proposed Model ◽

Nonlinear Geometric ◽

Better Than

Humans use the software in every walk of life thus it is essential to have the best quality software. Software defect prediction models assist in identifying defect prone modules with the help of historical data, which in turn improves software quality. Historical data consists of data related to modules /files/classes which are labeled as buggy or clean. As the number of buggy artifacts as less as compared to clean artifacts, the nature of historical data becomes imbalance. Due to this uneven distribution of the data, it difficult for classification algorithms to build highly effective SDP models. The objective of this study is to propose a new nonlinear geometric framework based on SMOTE and ensemble learning to improve the performance of SDP models. The study combines the traditional SMOTE algorithm and the novel ensemble Support Vector Machine (SVM) is used to develop the proposed framework called SMEnsemble. SMOTE algorithm handles the class imbalance problem by generating synthetic instances of the minority class. Ensemble learning generates multiple classification models to select the best performing SDP model. For experimentation, datasets from three different software repositories that contain both open source as well as proprietary projects are used in the study. The results show that SMEnsemble performs better than traditional methods for identifying the minority class i.e. buggy artifacts. Also, the proposed model performance is better than the latest state of Art SDP model- SMOTUNED. The proposed model is capable of handling imbalance classes when compared with traditional methods. Also, by carefully selecting the number of ensembles high performance can be achieved in less time.

Download Full-text

Isolation Forest Filter to Simplify Training Data for Cross-Project Defect Prediction

2019 Prognostics and System Health Management Conference (PHM-Qingdao) ◽

10.1109/phm-qingdao46334.2019.8942919 ◽

2019 ◽

Author(s):

Can Cui ◽

Bin Liu ◽

Shihai Wang

Keyword(s):

Training Data ◽

Defect Prediction ◽

Forest Filter ◽

Isolation Forest ◽

Cross Project

Download Full-text

Cross-version defect prediction: use historical data, cross-project data, or both?

Empirical Software Engineering ◽

10.1007/s10664-019-09777-8 ◽

2020 ◽

Vol 25 (2) ◽

pp. 1573-1595

Author(s):

Sousuke Amasaki

Keyword(s):

Historical Data ◽

Defect Prediction ◽

Project Data ◽

Cross Project

Download Full-text

Training data selection for imbalanced cross-project defect prediction

Computers & Electrical Engineering ◽

10.1016/j.compeleceng.2021.107370 ◽

2021 ◽

Vol 94 ◽

pp. 107370

Author(s):

Shang Zheng ◽

Jinjing Gai ◽

Hualong Yu ◽

Haitao Zou ◽

Shang Gao

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

Search Based Training Data Selection For Cross Project Defect Prediction

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering - PROMISE 2016 ◽

10.1145/2972958.2972964 ◽

2016 ◽

Cited By ~ 10

Author(s):

Seyedrebvar Hosseini ◽

Burak Turhan ◽

Mika Mäntylä

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

Computer Science ◽

10.7494/csci.2021.22.1.3743 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Muhammed Maruf Ozturk

Keyword(s):

Area Under The Curve ◽

Prediction Performance ◽

Training Data ◽

Defect Prediction ◽

Data Sets ◽

Clustering Method ◽

Testing Data ◽

Proper Training ◽

Comparison Algorithms ◽

Cross Project

Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance.

Download Full-text