Feature Subset Selection and Instance Filtering for Cross-project Defect Prediction - Classification and Ranking

CLEI electronic journal ◽

10.19153/cleiej.19.3.4 ◽

2016 ◽

Cited By ~ 1

Author(s):

Faimison Porto ◽

Adenilso Da Silva Simao

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Empirical Evaluation ◽

Feature Subset Selection ◽

Training Dataset ◽

Defect Prediction ◽

Feature Subset ◽

Good Tool ◽

Defect Prediction Models ◽

Cross Project

The defect prediction models can be a good tool on organizing the project´s test resources. The models can be constructed with two main goals: 1) to classify the software parts - defective or not; or 2) to rank the most defective parts in a decreasing order. However, not all companies maintain an appropriate set of historical defect data. In this case, a company can build an appropriate dataset from known external projects - called Cross-project Defect Prediction (CPDP).The CPDP models, however, present low prediction performances due to the heterogeneity of data. Recently, Instance Filtering methods were proposed in order to reduce this heterogeneity by selecting the most similar instances from the training dataset. Originally, the similarity is calculated based on all the available dataset features (or independent variables).We propose that using only the most relevant features on the similarity calculation can result in more accurate filtered datasets and better prediction performances. In this study we extend our previous work. We analyse both prediction goals - Classification and Ranking. We present an empirical evaluation of 41 different methods by associating Instance Filtering methods with Feature Selection methods. We used 36 versions of 11 open source projects on experiments.The results show similar evidences for both prediction goals. First, the defect prediction performance of CPDP models can be improved by associating Feature Selection and Instance Filtering. Second, no evaluated method presented general better performances. Indeed, the most appropriate method can vary according to the characteristics of the project being predicted.

Download Full-text

On the time-based conclusion stability of cross-project defect prediction models

Empirical Software Engineering ◽

10.1007/s10664-020-09878-9 ◽

2020 ◽

Vol 25 (6) ◽

pp. 5047-5083

Author(s):

Abdul Ali Bangash ◽

Hareem Sahar ◽

Abram Hindle ◽

Karim Ali

Keyword(s):

Prediction Models ◽

Defect Prediction ◽

Defect Prediction Models ◽

Cross Project

Download Full-text

Empirical Evaluation of Mixed-Project Defect Prediction Models

2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications ◽

10.1109/seaa.2011.59 ◽

2011 ◽

Cited By ~ 15

Author(s):

Burak Turhan ◽

Ayse Tosun ◽

Ayse Bener

Keyword(s):

Prediction Models ◽

Empirical Evaluation ◽

Defect Prediction ◽

Defect Prediction Models

Download Full-text

Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach

Applied Sciences ◽

10.3390/app9132764 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2764 ◽

Cited By ~ 8

Author(s):

Abdullateef Oluwagbemiga Balogun ◽

Shuib Basri ◽

Said Jadid Abdulkadir ◽

Ahmad Sobri Hashim

Keyword(s):

Software Metrics ◽

Prediction Models ◽

Predictive Performance ◽

Search Method ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect

Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature subset selection (FSS) methods were evaluated using four different classifiers over five software defect datasets obtained from the National Aeronautics and Space Administration (NASA) repository. The experimental analysis showed that the application of FS improves the predictive performance of classifiers and the performance of FS methods can vary across datasets and classifiers. In the FFR methods, Information Gain demonstrated the greatest improvements in the performance of the prediction models. In FSS methods, Consistency Feature Subset Selection based on Best First Search had the best influence on the prediction models. However, prediction models based on FFR proved to be more stable than those based on FSS methods. Hence, we conclude that FS methods improve the performance of SDP models, and that there is no single best FS method, as their performance varied according to datasets and the choice of the prediction model. However, we recommend the use of FFR methods as the prediction models based on FFR are more stable in terms of predictive performance.

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Empirical Evaluation of Cross-Release Effort-Aware Defect Prediction Models

2016 IEEE International Conference on Software Quality, Reliability and Security (QRS) ◽

10.1109/qrs.2016.33 ◽

2016 ◽

Cited By ~ 10

Author(s):

Kwabena Ebo Bennin ◽

Koji Toda ◽

Yasutaka Kamei ◽

Jacky Keung ◽

Akito Monden ◽

...

Keyword(s):

Prediction Models ◽

Empirical Evaluation ◽

Defect Prediction ◽

Defect Prediction Models

Download Full-text

Boosted Relief Feature Subset Selection and Heterogeneous Cross Project Defect Prediction using Firefly Particle Swarm Optimization

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6333.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 2605-2613

Keyword(s):

Particle Swarm Optimization ◽

Particle Swarm ◽

Subset Selection ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Swarm Optimization ◽

Software Defect ◽

Cross Project

The exponential growth in the field of information technology, need for quality-based software development is highly demanded. The important factor to be focused during the software development is software defect detection in earlier stages. Failure to detect hidden faults will affect the effectiveness and quality of the software usage and its maintenance. In traditional software defect prediction models, projects with same metrics are involved in prediction process. In recent years, active topic is dealing with Cross Project Defect Prediction (CPDP) to predict defects on software project from other software projects dataset. Still, traditional cross project defect prediction approaches also require common metrics among the dataset of two projects for constructing the defect prediction techniques. Suppose if cross project dataset with different metrics has to be used for defect prediction then these methods become infeasible. To overcome the issues in software defect prediction using Heterogeneous cross projects dataset, this paper introduced a Boosted Relief Feature Subset Selection (BRFSS) to handle the two different projects with Heterogeneous feature sets. BRFSS employs the mapping approach to embed the data from two different domains into a comparable feature space with a lower dimension. Based on the similarity measure the difference among the mapped domains of dataset are used for prediction process. This work used five different software groups with six different datasets to perform heterogeneous cross project defect prediction using firefly particle swarm optimization. To produce optimal defect prediction in the Heterogeneous environment, the knowledge of particle swarm optimization by inducing firefly algorithm. The simulation result is compared with other standard models, the outcome of the result proved the efficiency of the prediction process while using firefly enabled particle swarm optimization.

Download Full-text

Incremental Feature Selection Method for Software Defect Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1252.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 1345-1353 ◽

Cited By ~ 1

Keyword(s):

Feature Selection ◽

Software Metrics ◽

Prediction Models ◽

Search Algorithm ◽

Feature Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Defect Prediction Models ◽

Selection Of

Software defect prediction models are essential for understanding quality attributes relevant for software organization to deliver better software reliability. This paper focuses mainly based on the selection of attributes in the perspective of software quality estimation for incremental database. A new dimensionality reduction method Wilk’s Lambda Average Threshold (WLAT) is presented for selection of optimal features which are used for classifying modules as fault prone or not. This paper uses software metrics and defect data collected from benchmark data sets. The comparative results confirm that the statistical search algorithm (WLAT) outperforms the other relevant feature selection methods for most classifiers. The main advantage of the proposed WLAT method is: The selected features can be reused when there is increase or decrease in database size, without the need of extracting features afresh. In addition, performances of the defect prediction models either remains unchanged or improved even after eliminating 85% of the software metrics.

Download Full-text

Empirical assessment of feature selection techniques in defect prediction models using web applications

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-18473 ◽

2019 ◽

Vol 36 (6) ◽

pp. 6567-6578

Author(s):

Ruchika Malhotra ◽

Anjali Sharma

Keyword(s):

Feature Selection ◽

Web Applications ◽

Prediction Models ◽

Defect Prediction ◽

Empirical Assessment ◽

Defect Prediction Models ◽

Feature Selection Techniques

Download Full-text

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction

Computational Intelligence and Neuroscience ◽

10.1155/2021/5069016 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Saipunidzam Mahamad ◽

Luiz Fernando Capretz ◽

Abdullahi Abubakar Imam ◽

...

Keyword(s):

Feature Selection ◽

Rank Aggregation ◽

Feature Subset Selection ◽

Selection Problem ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Local Optima ◽

Software Defect ◽

Wrapper Feature Selection

The high dimensionality of software metric features has long been noted as a data quality problem that affects the performance of software defect prediction (SDP) models. This drawback makes it necessary to apply feature selection (FS) algorithm(s) in SDP processes. FS approaches can be categorized into three types, namely, filter FS (FFS), wrapper FS (WFS), and hybrid FS (HFS). HFS has been established as superior because it combines the strength of both FFS and WFS methods. However, selecting the most appropriate FFS (filter rank selection problem) for HFS is a challenge because the performance of FFS methods depends on the choice of datasets and classifiers. In addition, the local optima stagnation and high computational costs of WFS due to large search spaces are inherited by the HFS method. Therefore, as a solution, this study proposes a novel rank aggregation-based hybrid multifilter wrapper feature selection (RAHMFWFS) method for the selection of relevant and irredundant features from software defect datasets. The proposed RAHMFWFS is divided into two stepwise stages. The first stage involves a rank aggregation-based multifilter feature selection (RMFFS) method that addresses the filter rank selection problem by aggregating individual rank lists from multiple filter methods, using a novel rank aggregation method to generate a single, robust, and non-disjoint rank list. In the second stage, the aggregated ranked features are further preprocessed by an enhanced wrapper feature selection (EWFS) method based on a dynamic reranking strategy that is used to guide the feature subset selection process of the HFS method. This, in turn, reduces the number of evaluation cycles while amplifying or maintaining its prediction performance. The feasibility of the proposed RAHMFWFS was demonstrated on benchmarked software defect datasets with Naïve Bayes and Decision Tree classifiers, based on accuracy, the area under the curve (AUC), and F-measure values. The experimental results showed the effectiveness of RAHMFWFS in addressing filter rank selection and local optima stagnation problems in HFS, as well as the ability to select optimal features from SDP datasets while maintaining or enhancing the performance of SDP models. To conclude, the proposed RAHMFWFS achieved good performance by improving the prediction performances of SDP models across the selected datasets, compared to existing state-of-the-arts HFS methods.

Download Full-text

The Effect of the Dataset Size on the Accuracy of Software Defect Prediction Models: An Empirical Study

INTELIGENCIA ARTIFICIAL ◽

10.4114/intartif.vol24iss68pp72-88 ◽

2021 ◽

Vol 24 (68) ◽

pp. 72-88

Author(s):

Mohammad Alshayeb ◽

Mashaan A. Alshammari

Keyword(s):

Feature Selection ◽

Prediction Model ◽

Prediction Models ◽

Fault Prediction ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Dataset Size ◽

Defect Prediction Models ◽

Selection Algorithms

The ongoing development of computer systems requires massive software projects. Running the components of these huge projects for testing purposes might be a costly process; therefore, parameter estimation can be used instead. Software defect prediction models are crucial for software quality assurance. This study investigates the impact of dataset size and feature selection algorithms on software defect prediction models. We use two approaches to build software defect prediction models: a statistical approach and a machine learning approach with support vector machines (SVMs). The fault prediction model was built based on four datasets of different sizes. Additionally, four feature selection algorithms were used. We found that applying the SVM defect prediction model on datasets with a reduced number of measures as features may enhance the accuracy of the fault prediction model. Also, it directs the test effort to maintain the most influential set of metrics. We also found that the running time of the SVM fault prediction model is not consistent with dataset size. Therefore, having fewer metrics does not guarantee a shorter execution time. From the experiments, we found that dataset size has a direct influence on the SVM fault prediction model. However, reduced datasets performed the same or slightly lower than the original datasets.

Download Full-text