Isolation Forest Filter to Simplify Training Data for Cross-Project Defect Prediction

Often, the prior defect data of the same project is unavailable; researchers thought whether the defect data of the other projects can be used for prediction. This made cross project defect prediction an open research issue. In this approach, the training data often suffers from class imbalance problem. Here, the work is directed on homogeneous cross-project defect prediction. A novel ensemble model that will perform in dual fold is proposed. Firstly, it will handle the class imbalance problem of the dataset. Secondly, it will perform the prediction of the target class. For handling the imbalance problem, the training dataset is divided into data frames. Each data frame will be balanced. An ensemble model using the maximum voting of all random forest classifiers is implemented. The proposed model shows better performance in comparison to the other baseline models. Wilcoxon signed rank test is performed for validation of the proposed model.

Download Full-text

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419590377 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1959037 ◽

Cited By ~ 5

Author(s):

Shaojian Qiu ◽

Lu Lu ◽

Siyu Jiang ◽

Yang Guo

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Training Data ◽

Defect Prediction ◽

Class Imbalance Problem ◽

Learning Methods ◽

Imbalance Problem ◽

Intelligent Software ◽

Under Sampling ◽

Cross Project

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Download Full-text

An Improved Method for Training Data Selection for Cross-Project Defect Prediction

Arabian Journal for Science and Engineering ◽

10.1007/s13369-021-06088-3 ◽

2021 ◽

Author(s):

Nayeem Ahmad Bhat ◽

Sheikh Umar Farooq

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Improved Method ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

Training data selection for cross-project defect prediction

Proceedings of the 9th International Conference on Predictive Models in Software Engineering - PROMISE '13 ◽

10.1145/2499393.2499395 ◽

2013 ◽

Cited By ~ 50

Author(s):

Steffen Herbold

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

Training data selection for imbalanced cross-project defect prediction

Computers & Electrical Engineering ◽

10.1016/j.compeleceng.2021.107370 ◽

2021 ◽

Vol 94 ◽

pp. 107370

Author(s):

Shang Zheng ◽

Jinjing Gai ◽

Hualong Yu ◽

Haitao Zou ◽

Shang Gao

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

Search Based Training Data Selection For Cross Project Defect Prediction

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering - PROMISE 2016 ◽

10.1145/2972958.2972964 ◽

2016 ◽

Cited By ~ 10

Author(s):

Seyedrebvar Hosseini ◽

Burak Turhan ◽

Mika Mäntylä

Keyword(s):

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text

complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

Computer Science ◽

10.7494/csci.2021.22.1.3743 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Muhammed Maruf Ozturk

Keyword(s):

Area Under The Curve ◽

Prediction Performance ◽

Training Data ◽

Defect Prediction ◽

Data Sets ◽

Clustering Method ◽

Testing Data ◽

Proper Training ◽

Comparison Algorithms ◽

Cross Project

Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance.

Download Full-text

A Three-Level Training Data Filter for Cross-project Defect Prediction

Wireless and Satellite Systems - Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ◽

10.1007/978-3-030-69069-4_10 ◽

2021 ◽

pp. 109-123

Author(s):

Cangzhou Yuan ◽

Xiaowei Wang ◽

Xinxin Ke ◽

Panpan Zhan

Keyword(s):

Training Data ◽

Defect Prediction ◽

Level Training ◽

Cross Project

Download Full-text

An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

Mathematical Problems in Engineering ◽

10.1155/2018/2650415 ◽

2018 ◽

Vol 2018 ◽

pp. 1-18 ◽

Cited By ~ 4

Author(s):

Peng He ◽

Yao He ◽

Lvjun Yu ◽

Bing Li

Keyword(s):

Euclidean Distance ◽

Historical Data ◽

Training Data ◽

Defect Prediction ◽

Improved Method ◽

Additional Experiment ◽

Weighted Function ◽

Public Repositories ◽

Better Than ◽

Cross Project

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. The objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances’ similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. The results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. That is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.

Download Full-text

An Exploratory Study of Search Based Training Data Selection for Cross Project Defect Prediction

2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) ◽

10.1109/seaa.2018.00048 ◽

2018 ◽

Author(s):

Seyedrebvar Hosseini ◽

Burak Turhan

Keyword(s):

Exploratory Study ◽

Training Data ◽

Data Selection ◽

Defect Prediction ◽

Selection For ◽

Training Data Selection ◽

Cross Project

Download Full-text