Impact of feature selection on classification via clustering techniques in software defect prediction

F.E. Usman-Hamza; A.F. Atte; A.O. Balogun; H.A. Mojeed; A.O. Bajeh; V.E. Adeyemo

doi:10.4314/jcsia.v26i1.8

Impact of feature selection on classification via clustering techniques in software defect prediction

Journal of Computer Science and Its Application ◽

10.4314/jcsia.v26i1.8 ◽

2020 ◽

Vol 26 (1) ◽

Cited By ~ 1

Author(s):

F.E. Usman-Hamza ◽

A.F. Atte ◽

A.O. Balogun ◽

H.A. Mojeed ◽

A.O. Bajeh ◽

...

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Predictive Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Selection Methods ◽

Clustering Techniques ◽

Software Defect ◽

The Impact

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019

Download Full-text

Search-Based Wrapper Feature Selection Methods in Software Defect Prediction: An Empirical Analysis

Intelligent Algorithms in Software Engineering - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-51965-0_43 ◽

2020 ◽

pp. 492-503 ◽

Cited By ~ 1

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Said A. Jadid ◽

Saipunidzam Mahamad ◽

Malek A. Al-momani ◽

...

Keyword(s):

Feature Selection ◽

Empirical Analysis ◽

Defect Prediction ◽

Software Defect Prediction ◽

Selection Methods ◽

Software Defect ◽

Wrapper Feature Selection

Download Full-text

A Hybrid Feature Selection Method for Software Defect Prediction

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/394/3/032035 ◽

2018 ◽

Vol 394 ◽

pp. 032035 ◽

Cited By ~ 1

Author(s):

Lina Jia

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

FSCR:A Feature Selection Method for Software Defect Prediction

Proceedings of the 29th International Conference on Software Engineering and Knowledge Engineering ◽

10.18293/seke2017-081 ◽

2017 ◽

Cited By ~ 1

Author(s):

Xiao Yu ◽

Ziyi Ma ◽

Chuanxiang Ma ◽

Yi Gu ◽

Ruiqi Liu ◽

...

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

A cluster-analysis-based feature-selection method for software defect prediction

Scientia Sinica Informationis ◽

10.1360/n112015-00276 ◽

2016 ◽

Vol 46 (9) ◽

pp. 1298-1320 ◽

Cited By ~ 2

Author(s):

Qing GU ◽

Shulong LIU ◽

Wangshu LIU ◽

Daoxu CHEN ◽

Xiang CHEN

Keyword(s):

Cluster Analysis ◽

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

Rank Aggregation Based Multi-filter Feature Selection Method for Software Defect Prediction

Communications in Computer and Information Science - Advances in Cyber Security ◽

10.1007/978-981-33-6835-4_25 ◽

2021 ◽

pp. 371-383

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Said Jadid Abdulkadir ◽

Saipunidzam Mahamad ◽

Malek A. Al-momamni ◽

...

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Rank Aggregation ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study

Symmetry ◽

10.3390/sym12071147 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1147 ◽

Cited By ~ 2

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Saipunidzam Mahamad ◽

Said J. Abdulkadir ◽

Malek A. Almomani ◽

...

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Prediction Models ◽

Empirical Studies ◽

Experimental Results ◽

Defect Prediction ◽

Software Defect Prediction ◽

Search Methods ◽

Software Defect ◽

The Impact

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

Download Full-text

The Comparison of Feature Selection Methods in Software Defect Prediction

2020 4th International Conference on Informatics and Computational Sciences (ICICoS) ◽

10.1109/icicos51170.2020.9299022 ◽

2020 ◽

Author(s):

Khadijah ◽

Amazona Adorada ◽

Panji Wisnu Wirawan ◽

Kabul Kurniawan

Keyword(s):

Feature Selection ◽

Defect Prediction ◽

Software Defect Prediction ◽

Selection Methods ◽

Software Defect

Download Full-text

Incremental Feature Selection Method for Software Defect Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1252.0782s319 ◽

2019 ◽

Vol 8 (2S3) ◽

pp. 1345-1353 ◽

Cited By ~ 1

Keyword(s):

Feature Selection ◽

Software Metrics ◽

Prediction Models ◽

Search Algorithm ◽

Feature Selection Method ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Defect Prediction Models ◽

Selection Of

Software defect prediction models are essential for understanding quality attributes relevant for software organization to deliver better software reliability. This paper focuses mainly based on the selection of attributes in the perspective of software quality estimation for incremental database. A new dimensionality reduction method Wilk’s Lambda Average Threshold (WLAT) is presented for selection of optimal features which are used for classifying modules as fault prone or not. This paper uses software metrics and defect data collected from benchmark data sets. The comparative results confirm that the statistical search algorithm (WLAT) outperforms the other relevant feature selection methods for most classifiers. The main advantage of the proposed WLAT method is: The selected features can be reused when there is increase or decrease in database size, without the need of extracting features afresh. In addition, performances of the defect prediction models either remains unchanged or improved even after eliminating 85% of the software metrics.

Download Full-text

Towards Predicting Software Defects with Clustering Techniques

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2021.12103 ◽

2021 ◽

Vol 12 (1) ◽

pp. 39-54

Author(s):

Waheeda Almayyan

Keyword(s):

Feature Selection ◽

Predictive Model ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Grey Wolf Optimizer ◽

Software Defect Prediction ◽

Software Defects ◽

Particle Swarm Optimizer ◽

Clustering Techniques ◽

Software Defect

The purpose of software defect prediction is to improve the quality of a software project by building a predictive model to decide whether a software module is or is not fault prone. In recent years, much research in using machine learning techniques in this topic has been performed. Our aim was to evaluate the performance of clustering techniques with feature selection schemes to address the problem of software defect prediction problem. We analysed the National Aeronautics and Space Administration (NASA) dataset benchmarks using three clustering algorithms: (1) Farthest First, (2) X-Means, and (3) selforganizing map (SOM). In order to evaluate different feature selection algorithms, this article presents a comparative analysis involving software defects prediction based on Bat, Cuckoo, Grey Wolf Optimizer (GWO), and particle swarm optimizer (PSO). The results obtained with the proposed clustering models enabled us to build an efficient predictive model with a satisfactory detection rate and acceptable number of features.

Download Full-text

An Adaptive Rank Aggregation-Based Ensemble Multi-Filter Feature Selection Method in Software Defect Prediction

Entropy ◽

10.3390/e23101274 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1274

Author(s):

Abdullateef O. Balogun ◽

Shuib Basri ◽

Luiz Fernando Capretz ◽

Saipunidzam Mahamad ◽

Abdullahi A. Imam ◽

...

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Rank Aggregation ◽

Selection Problem ◽

High Dimensionality ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Filter Methods ◽

Superior Effect

Feature selection is known to be an applicable solution to address the problem of high dimensionality in software defect prediction (SDP). However, choosing an appropriate filter feature selection (FFS) method that will generate and guarantee optimal features in SDP is an open research issue, known as the filter rank selection problem. As a solution, the combination of multiple filter methods can alleviate the filter rank selection problem. In this study, a novel adaptive rank aggregation-based ensemble multi-filter feature selection (AREMFFS) method is proposed to resolve high dimensionality and filter rank selection problems in SDP. Specifically, the proposed AREMFFS method is based on assessing and combining the strengths of individual FFS methods by aggregating multiple rank lists in the generation and subsequent selection of top-ranked features to be used in the SDP process. The efficacy of the proposed AREMFFS method is evaluated with decision tree (DT) and naïve Bayes (NB) models on defect datasets from different repositories with diverse defect granularities. Findings from the experimental results indicated the superiority of AREMFFS over other baseline FFS methods that were evaluated, existing rank aggregation based multi-filter FS methods, and variants of AREMFFS as developed in this study. That is, the proposed AREMFFS method not only had a superior effect on prediction performances of SDP models but also outperformed baseline FS methods and existing rank aggregation based multi-filter FS methods. Therefore, this study recommends the combination of multiple FFS methods to utilize the strength of respective FFS methods and take advantage of filter–filter relationships in selecting optimal features for SDP processes.

Download Full-text