An Empirical Study on The Impact of The Interaction between Feature Selection and Sampling in Defect Prediction

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

Download Full-text

Inter-release defect prediction with feature selection using temporal chunk-based learning: An empirical study

Applied Soft Computing ◽

10.1016/j.asoc.2021.107870 ◽

2021 ◽

Vol 113 ◽

pp. 107870

Author(s):

Md Alamgir Kabir ◽

Jacky Keung ◽

Burak Turhan ◽

Kwabena Ebo Bennin

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Defect Prediction

Download Full-text

An empirical study on pareto based multi-objective feature selection for software defect prediction

Journal of Systems and Software ◽

10.1016/j.jss.2019.03.012 ◽

2019 ◽

Vol 152 ◽

pp. 215-238 ◽

Cited By ~ 9

Author(s):

Chao Ni ◽

Xiang Chen ◽

Fangfang Wu ◽

Yuxiang Shen ◽

Qing Gu

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Defect Prediction ◽

Software Defect Prediction ◽

Multi Objective ◽

Software Defect ◽

Selection For

Download Full-text

Impact of feature selection on classification via clustering techniques in software defect prediction

Journal of Computer Science and Its Application ◽

10.4314/jcsia.v26i1.8 ◽

2020 ◽

Vol 26 (1) ◽

Cited By ~ 1

Author(s):

F.E. Usman-Hamza ◽

A.F. Atte ◽

A.O. Balogun ◽

H.A. Mojeed ◽

A.O. Bajeh ◽

...

Keyword(s):

Feature Selection ◽

Information Gain ◽

Feature Selection Method ◽

Predictive Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Selection Methods ◽

Clustering Techniques ◽

Software Defect ◽

The Impact

Software testing using software defect prediction aims to detect as many defects as possible in software before the software release. This plays an important role in ensuring quality and reliability. Software defect prediction can be modeled as a classification problem that classifies software modules into two classes: defective and non-defective; and classification algorithms are used for this process. This study investigated the impact of feature selection methods on classification via clustering techniques for software defect prediction. Three clustering techniques were selected; Farthest First Clusterer, K-Means and Make-Density Clusterer, and three feature selection methods: Chi-Square, Clustering Variation, and Information Gain were used on software defect datasets from NASA repository. The best software defect prediction model was farthest-first using information gain feature selection method with an accuracy of 78.69%, precision value of 0.804 and recall value of 0.788. The experimental results showed that the use of clustering techniques as a classifier gave a good predictive performance and feature selection methods further enhanced their performance. This indicates that classification via clustering techniques can give competitive results against standard classification methods with the advantage of not having to train any model using labeled dataset; as it can be used on the unlabeled datasets.Keywords: Classification, Clustering, Feature Selection, Software Defect PredictionVol. 26, No 1, June, 2019

Download Full-text

An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

International Journal of Software Science and Computational Intelligence ◽

10.4018/ijssci.2021010101 ◽

2021 ◽

Vol 13 (1) ◽

pp. 1-16

Author(s):

Amit Saxena ◽

John Wang ◽

Wutiphol Sintunavarat

Keyword(s):

Data Mining ◽

Feature Selection ◽

Empirical Study ◽

Data Sets ◽

Data Set ◽

Unsupervised Feature Selection ◽

The Impact

One of the main problems in K-means clustering is setting of initial centroids which can cause misclustering of patterns which affects clustering accuracy. Recently, a density and distance-based technique for determining initial centroids has claimed a faster convergence of clusters. Motivated from this key idea, the authors study the impact of initial centroids on clustering accuracy for unsupervised feature selection. Three metrics are used to rank the features of a data set. The centroids of the clusters in the data sets, to be applied in K-means clustering, are initialized randomly as well as by density and distance-based approaches. Extensive experiments are performed on 15 datasets. The main significance of the paper is that the K-means clustering yields higher accuracies in majority of these datasets using proposed density and distance-based approach. As an impact of the paper, with fewer features, a good clustering accuracy can be achieved which can be useful in data mining of data sets with thousands of features.

Download Full-text