Impact of feature selection on classification via clustering techniques in software defect prediction

Author(s):  
F.E. Usman-Hamza ◽  
A.F. Atte ◽  
A.O. Balogun ◽  
H.A. Mojeed ◽  
A.O. Bajeh ◽  
...  

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019

2016 ◽  
Vol 46 (9) ◽  
pp. 1298-1320 ◽  
Author(s):  
Qing GU ◽  
Shulong LIU ◽  
Wangshu LIU ◽  
Daoxu CHEN ◽  
Xiang CHEN

Symmetry ◽  
2020 ◽  
Vol 12 (7) ◽  
pp. 1147 ◽  
Author(s):  
Abdullateef O. Balogun ◽  
Shuib Basri ◽  
Saipunidzam Mahamad ◽  
Said J. Abdulkadir ◽  
Malek A. Almomani ◽  
...  

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.


2019 ◽  
Vol 8 (2S3) ◽  
pp. 1345-1353 ◽  

Software defect prediction models are essential for understanding quality attributes relevant for software organization to deliver better software reliability. This paper focuses mainly based on the selection of attributes in the perspective of software quality estimation for incremental database. A new dimensionality reduction method Wilk’s Lambda Average Threshold (WLAT) is presented for selection of optimal features which are used for classifying modules as fault prone or not. This paper uses software metrics and defect data collected from benchmark data sets. The comparative results confirm that the statistical search algorithm (WLAT) outperforms the other relevant feature selection methods for most classifiers. The main advantage of the proposed WLAT method is: The selected features can be reused when there is increase or decrease in database size, without the need of extracting features afresh. In addition, performances of the defect prediction models either remains unchanged or improved even after eliminating 85% of the software metrics.


Author(s):  
Waheeda Almayyan

The purpose of software defect prediction is to improve the quality of a software project by building a predictive model to decide whether a software module is or is not fault prone. In recent years, much research in using machine learning techniques in this topic has been performed. Our aim was to evaluate the performance of clustering techniques with feature selection schemes to address the problem of software defect prediction problem. We analysed the National Aeronautics and Space Administration (NASA) dataset benchmarks using three clustering algorithms: (1) Farthest First, (2) X-Means, and (3) selforganizing map (SOM). In order to evaluate different feature selection algorithms, this article presents a comparative analysis involving software defects prediction based on Bat, Cuckoo, Grey Wolf Optimizer (GWO), and particle swarm optimizer (PSO). The results obtained with the proposed clustering models enabled us to build an efficient predictive model with a satisfactory detection rate and acceptable number of features.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1274
Author(s):  
Abdullateef O. Balogun ◽  
Shuib Basri ◽  
Luiz Fernando Capretz ◽  
Saipunidzam Mahamad ◽  
Abdullahi A. Imam ◽  
...  

Feature selection is known to be an applicable solution to address the problem of high dimensionality in software defect prediction (SDP). However, choosing an appropriate filter feature selection (FFS) method that will generate and guarantee optimal features in SDP is an open research issue, known as the filter rank selection problem. As a solution, the combination of multiple filter methods can alleviate the filter rank selection problem. In this study, a novel adaptive rank aggregation-based ensemble multi-filter feature selection (AREMFFS) method is proposed to resolve high dimensionality and filter rank selection problems in SDP. Specifically, the proposed AREMFFS method is based on assessing and combining the strengths of individual FFS methods by aggregating multiple rank lists in the generation and subsequent selection of top-ranked features to be used in the SDP process. The efficacy of the proposed AREMFFS method is evaluated with decision tree (DT) and naïve Bayes (NB) models on defect datasets from different repositories with diverse defect granularities. Findings from the experimental results indicated the superiority of AREMFFS over other baseline FFS methods that were evaluated, existing rank aggregation based multi-filter FS methods, and variants of AREMFFS as developed in this study. That is, the proposed AREMFFS method not only had a superior effect on prediction performances of SDP models but also outperformed baseline FS methods and existing rank aggregation based multi-filter FS methods. Therefore, this study recommends the combination of multiple FFS methods to utilize the strength of respective FFS methods and take advantage of filter–filter relationships in selecting optimal features for SDP processes.


Sign in / Sign up

Export Citation Format

Share Document