Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach

Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature subset selection (FSS) methods were evaluated using four different classifiers over five software defect datasets obtained from the National Aeronautics and Space Administration (NASA) repository. The experimental analysis showed that the application of FS improves the predictive performance of classifiers and the performance of FS methods can vary across datasets and classifiers. In the FFR methods, Information Gain demonstrated the greatest improvements in the performance of the prediction models. In FSS methods, Consistency Feature Subset Selection based on Best First Search had the best influence on the prediction models. However, prediction models based on FFR proved to be more stable than those based on FSS methods. Hence, we conclude that FS methods improve the performance of SDP models, and that there is no single best FS method, as their performance varied according to datasets and the choice of the prediction model. However, we recommend the use of FFR methods as the prediction models based on FFR are more stable in terms of predictive performance.

Download Full-text

Boosted Relief Feature Subset Selection and Heterogeneous Cross Project Defect Prediction using Firefly Particle Swarm Optimization

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6333.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 2605-2613

Keyword(s):

Particle Swarm Optimization ◽

Particle Swarm ◽

Subset Selection ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Swarm Optimization ◽

Software Defect ◽

Cross Project

The exponential growth in the field of information technology, need for quality-based software development is highly demanded. The important factor to be focused during the software development is software defect detection in earlier stages. Failure to detect hidden faults will affect the effectiveness and quality of the software usage and its maintenance. In traditional software defect prediction models, projects with same metrics are involved in prediction process. In recent years, active topic is dealing with Cross Project Defect Prediction (CPDP) to predict defects on software project from other software projects dataset. Still, traditional cross project defect prediction approaches also require common metrics among the dataset of two projects for constructing the defect prediction techniques. Suppose if cross project dataset with different metrics has to be used for defect prediction then these methods become infeasible. To overcome the issues in software defect prediction using Heterogeneous cross projects dataset, this paper introduced a Boosted Relief Feature Subset Selection (BRFSS) to handle the two different projects with Heterogeneous feature sets. BRFSS employs the mapping approach to embed the data from two different domains into a comparable feature space with a lower dimension. Based on the similarity measure the difference among the mapped domains of dataset are used for prediction process. This work used five different software groups with six different datasets to perform heterogeneous cross project defect prediction using firefly particle swarm optimization. To produce optimal defect prediction in the Heterogeneous environment, the knowledge of particle swarm optimization by inducing firefly algorithm. The simulation result is compared with other standard models, the outcome of the result proved the efficiency of the prediction process while using firefly enabled particle swarm optimization.

Download Full-text

EMPIRICAL ASSESSMENT OF MACHINE LEARNING BASED SOFTWARE DEFECT PREDICTION TECHNIQUES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213008003947 ◽

2008 ◽

Vol 17 (02) ◽

pp. 389-400 ◽

Cited By ~ 41

Author(s):

VENKATA UDAYA B. CHALLAGULLA ◽

FAROKH B. BASTANI ◽

I-LING YEN ◽

RAYMOND A. PAUL

Keyword(s):

Machine Learning ◽

Feature Subset Selection ◽

Defect Prediction ◽

Data Sets ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect ◽

Intelligent Software ◽

Instance Based Learning ◽

Prediction Techniques

Automated reliability assessment is essential for systems that entail dynamic adaptation based on runtime mission-specific requirements. One approach along this direction is to monitor and assess the system using machine learning-based software defect prediction techniques. Due to the dynamic nature of software data collected, Instance-based learning algorithms are proposed for the above purposes. To evaluate the accuracy of these methods, the paper presents an empirical analysis of four different real-time software defect data sets using different predictor models. The results show that a combination of 1R and Instance-based learning along with Consistency-based subset evaluation technique provides a relatively better consistency in achieving accurate predictions as compared with other models. No direct relationship is observed between the skewness present in the data sets and the prediction accuracy of these models. Principal Component Analysis (PCA) does not show a consistent advantage in improving the accuracy of the predictions. While random reduction of attributes gave poor accuracy results, simple Feature Subset Selection methods performed better than PCA for most prediction models. Based on these results, the paper presents a high-level design of an Intelligent Software Defect Analysis tool (ISDAT) for dynamic monitoring and defect assessment of software modules.

Download Full-text

Incremental Feature Selection Method for Software Defect Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1252.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 1345-1353 ◽

Cited By ~ 1

Keyword(s):

Feature Selection ◽

Software Metrics ◽

Prediction Models ◽

Search Algorithm ◽

Feature Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Defect Prediction Models ◽

Selection Of

Software defect prediction models are essential for understanding quality attributes relevant for software organization to deliver better software reliability. This paper focuses mainly based on the selection of attributes in the perspective of software quality estimation for incremental database. A new dimensionality reduction method Wilk’s Lambda Average Threshold (WLAT) is presented for selection of optimal features which are used for classifying modules as fault prone or not. This paper uses software metrics and defect data collected from benchmark data sets. The comparative results confirm that the statistical search algorithm (WLAT) outperforms the other relevant feature selection methods for most classifiers. The main advantage of the proposed WLAT method is: The selected features can be reused when there is increase or decrease in database size, without the need of extracting features afresh. In addition, performances of the defect prediction models either remains unchanged or improved even after eliminating 85% of the software metrics.

Download Full-text

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

EPJ Web of Conferences ◽

10.1051/epjconf/202024505041 ◽

2020 ◽

Vol 245 ◽

pp. 05041

Author(s):

Elisabetta Ronchieri ◽

Marco Canaparo ◽

Mauro Belgiovine ◽

Davide Salomoni ◽

Barbara Martelli

Keyword(s):

Machine Learning ◽

Software Metrics ◽

Prediction Models ◽

Lessons Learned ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Software Modules

Software defect prediction is an activity that aims at narrowing down the most likely defect-prone software modules and helping developers and testers to prioritize inspection and testing. This activity can be addressed by using Machine Learning techniques applied to software metrics datasets that are usually unlabelled, i.e. they lack modules classification in terms of defectiveness. To overcome this limitation, in addition to the usual data pre-processing operations to manage mission values and/or to remove inconsistencies, researches have to adopt an approach to label their unlabelled software datasets. The extraction of defectiveness data to label all the instances of the datasets is an extremely time and effort consuming operation. In literature, many studies have introduced approaches to build a defect prediction models on unlabelled datasets. In this paper, we describe the analysis of new unlabelled datasets from WLCG software, coming from HEP-related experiments and middleware, by using Machine Learning techniques. We have experimented new approaches to label the various modules due to the heterogeneity of software metrics distribution. We discuss a number of lessons learned from conducting these activities, what has worked, what has not and how our research can be improved.

Download Full-text

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction

Computational Intelligence and Neuroscience ◽

10.1155/2021/5069016 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Saipunidzam Mahamad ◽

Luiz Fernando Capretz ◽

Abdullahi Abubakar Imam ◽

...

Keyword(s):

Feature Selection ◽

Rank Aggregation ◽

Feature Subset Selection ◽

Selection Problem ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Local Optima ◽

Software Defect ◽

Wrapper Feature Selection

The high dimensionality of software metric features has long been noted as a data quality problem that affects the performance of software defect prediction (SDP) models. This drawback makes it necessary to apply feature selection (FS) algorithm(s) in SDP processes. FS approaches can be categorized into three types, namely, filter FS (FFS), wrapper FS (WFS), and hybrid FS (HFS). HFS has been established as superior because it combines the strength of both FFS and WFS methods. However, selecting the most appropriate FFS (filter rank selection problem) for HFS is a challenge because the performance of FFS methods depends on the choice of datasets and classifiers. In addition, the local optima stagnation and high computational costs of WFS due to large search spaces are inherited by the HFS method. Therefore, as a solution, this study proposes a novel rank aggregation-based hybrid multifilter wrapper feature selection (RAHMFWFS) method for the selection of relevant and irredundant features from software defect datasets. The proposed RAHMFWFS is divided into two stepwise stages. The first stage involves a rank aggregation-based multifilter feature selection (RMFFS) method that addresses the filter rank selection problem by aggregating individual rank lists from multiple filter methods, using a novel rank aggregation method to generate a single, robust, and non-disjoint rank list. In the second stage, the aggregated ranked features are further preprocessed by an enhanced wrapper feature selection (EWFS) method based on a dynamic reranking strategy that is used to guide the feature subset selection process of the HFS method. This, in turn, reduces the number of evaluation cycles while amplifying or maintaining its prediction performance. The feasibility of the proposed RAHMFWFS was demonstrated on benchmarked software defect datasets with Naïve Bayes and Decision Tree classifiers, based on accuracy, the area under the curve (AUC), and F-measure values. The experimental results showed the effectiveness of RAHMFWFS in addressing filter rank selection and local optima stagnation problems in HFS, as well as the ability to select optimal features from SDP datasets while maintaining or enhancing the performance of SDP models. To conclude, the proposed RAHMFWFS achieved good performance by improving the prediction performances of SDP models across the selected datasets, compared to existing state-of-the-arts HFS methods.

Download Full-text

A Novel Feature Subset Selection Algorithm for Software Defect Prediction

International Journal of Computer Applications ◽

10.5120/17618-8315 ◽

2014 ◽

Vol 100 (17) ◽

pp. 39-43

Author(s):

Reena P ◽

Binu Rajan

Keyword(s):

Subset Selection ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Selection Algorithm ◽

Software Defect

Download Full-text

An Evidential approach on Feature Subset Selection in Software Defect Prediction

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i12.4149 ◽

2017 ◽

Vol 5 (12) ◽

pp. 41-49

Author(s):

M. Jaikumar ◽

◽

V. Kathiresan

Keyword(s):

Subset Selection ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

10.1109/icscc51209.2021.9528170 ◽

2021 ◽

Author(s):

Sushant Kumar Pandey ◽

Anil Kumar Tripathi

Keyword(s):

Machine Learning ◽

Empirical Study ◽

Prediction Models ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Defect Prediction Models

Download Full-text

Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models

2017 24th Asia-Pacific Software Engineering Conference (APSEC) ◽

10.1109/apsec.2017.76 ◽

2017 ◽

Author(s):

Kwabena Ebo Bennin ◽

Jacky Keung ◽

Akito Monden

Keyword(s):

Prediction Models ◽

Distribution Parameter ◽

Defect Prediction ◽

Software Defect Prediction ◽

Data Sampling ◽

Software Defect ◽

Defect Prediction Models

Download Full-text

A New Framework Consisted of Data Preprocessing and Classifier Modelling for Software Defect Prediction

Mathematical Problems in Engineering ◽

10.1155/2018/9616938 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Haijin Ji ◽

Song Huang

Keyword(s):

Software Metrics ◽

Data Preprocessing ◽

Defect Prediction ◽

Data Repository ◽

Software Defect Prediction ◽

Software Projects ◽

Software Defect ◽

Nonnormal Distribution ◽

Non Gaussian ◽

New Framework

Different data preprocessing methods and classifiers have been established and evaluated earlier for the software defect prediction (SDP) across projects. These novel approaches have provided relatively acceptable prediction results for different software projects. However, to the best of our knowledge, few researchers have combined data preprocessing and building robust classifier simultaneously to improve prediction performances in SDP. Therefore, this paper presents a new whole framework for predicting fault-prone software modules. The proposed framework consists of instance filtering, feature selection, instance reduction, and establishing a new classifier. Additionally, we find that the 21 main software metrics commonly do follow nonnormal distribution after performing a Kolmogorov-Smirnov test. Therefore, the newly proposed classifier is built on the maximum correntropy criterion (MCC). The MCC is well-known for its effectiveness in handling non-Gaussian noise. To evaluate the new framework, the experimental study is designed with due care using nine open-source software projects with their 32 releases, obtained from the PROMISE data repository. The prediction accuracy is evaluated using F-measure. The state-of-the-art methods for Cross-Project Defect Prediction are also included for comparison. All of the evidences derived from the experimentation verify the effectiveness and robustness of our new framework.

Download Full-text