scholarly journals Feature Subset Selection and Instance Filtering for Cross-project Defect Prediction - Classification and Ranking

Author(s):  
Faimison Porto ◽  
Adenilso Da Silva Simao

The defect prediction models can be a good tool on organizing the project´s test resources. The models can be constructed with two main goals: 1) to classify the software parts - defective or not; or 2) to rank the most defective parts in a decreasing order. However, not all companies maintain an appropriate set of historical defect data. In this case, a company can build an appropriate dataset from known external projects - called Cross-project Defect Prediction (CPDP).The CPDP models, however, present low prediction performances due to the heterogeneity of data. Recently, Instance Filtering methods were proposed in order to reduce this heterogeneity by selecting the most similar instances from the training dataset. Originally, the similarity is calculated based on all the available dataset features (or independent variables).We propose that using only the most relevant features on the similarity calculation can result in more accurate filtered datasets and better prediction performances. In this study we extend our previous work. We analyse both prediction goals - Classification and Ranking. We present an empirical evaluation of 41 different methods by associating Instance Filtering methods with Feature Selection methods. We used 36 versions of 11 open source projects on experiments.The results show similar evidences for both prediction goals. First, the defect prediction performance of CPDP models can be improved by associating Feature Selection and Instance Filtering. Second, no evaluated method presented general better performances. Indeed, the most appropriate method can vary according to the characteristics of the project being predicted.

2020 ◽  
Vol 25 (6) ◽  
pp. 5047-5083
Author(s):  
Abdul Ali Bangash ◽  
Hareem Sahar ◽  
Abram Hindle ◽  
Karim Ali

2019 ◽  
Vol 9 (13) ◽  
pp. 2764 ◽  
Author(s):  
Abdullateef Oluwagbemiga Balogun ◽  
Shuib Basri ◽  
Said Jadid Abdulkadir ◽  
Ahmad Sobri Hashim

Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature subset selection (FSS) methods were evaluated using four different classifiers over five software defect datasets obtained from the National Aeronautics and Space Administration (NASA) repository. The experimental analysis showed that the application of FS improves the predictive performance of classifiers and the performance of FS methods can vary across datasets and classifiers. In the FFR methods, Information Gain demonstrated the greatest improvements in the performance of the prediction models. In FSS methods, Consistency Feature Subset Selection based on Best First Search had the best influence on the prediction models. However, prediction models based on FFR proved to be more stable than those based on FSS methods. Hence, we conclude that FS methods improve the performance of SDP models, and that there is no single best FS method, as their performance varied according to datasets and the choice of the prediction model. However, we recommend the use of FFR methods as the prediction models based on FFR are more stable in terms of predictive performance.


Author(s):  
Kehan Gao ◽  
Taghi M. Khoshgoftaar ◽  
Amri Napolitano

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.


2020 ◽  
Vol 8 (5) ◽  
pp. 2605-2613

The exponential growth in the field of information technology, need for quality-based software development is highly demanded. The important factor to be focused during the software development is software defect detection in earlier stages. Failure to detect hidden faults will affect the effectiveness and quality of the software usage and its maintenance. In traditional software defect prediction models, projects with same metrics are involved in prediction process. In recent years, active topic is dealing with Cross Project Defect Prediction (CPDP) to predict defects on software project from other software projects dataset. Still, traditional cross project defect prediction approaches also require common metrics among the dataset of two projects for constructing the defect prediction techniques. Suppose if cross project dataset with different metrics has to be used for defect prediction then these methods become infeasible. To overcome the issues in software defect prediction using Heterogeneous cross projects dataset, this paper introduced a Boosted Relief Feature Subset Selection (BRFSS) to handle the two different projects with Heterogeneous feature sets. BRFSS employs the mapping approach to embed the data from two different domains into a comparable feature space with a lower dimension. Based on the similarity measure the difference among the mapped domains of dataset are used for prediction process. This work used five different software groups with six different datasets to perform heterogeneous cross project defect prediction using firefly particle swarm optimization. To produce optimal defect prediction in the Heterogeneous environment, the knowledge of particle swarm optimization by inducing firefly algorithm. The simulation result is compared with other standard models, the outcome of the result proved the efficiency of the prediction process while using firefly enabled particle swarm optimization.


2019 ◽  
Vol 8 (2S3) ◽  
pp. 1345-1353 ◽  

Software defect prediction models are essential for understanding quality attributes relevant for software organization to deliver better software reliability. This paper focuses mainly based on the selection of attributes in the perspective of software quality estimation for incremental database. A new dimensionality reduction method Wilk’s Lambda Average Threshold (WLAT) is presented for selection of optimal features which are used for classifying modules as fault prone or not. This paper uses software metrics and defect data collected from benchmark data sets. The comparative results confirm that the statistical search algorithm (WLAT) outperforms the other relevant feature selection methods for most classifiers. The main advantage of the proposed WLAT method is: The selected features can be reused when there is increase or decrease in database size, without the need of extracting features afresh. In addition, performances of the defect prediction models either remains unchanged or improved even after eliminating 85% of the software metrics.


2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Abdullateef O. Balogun ◽  
Shuib Basri ◽  
Saipunidzam Mahamad ◽  
Luiz Fernando Capretz ◽  
Abdullahi Abubakar Imam ◽  
...  

The high dimensionality of software metric features has long been noted as a data quality problem that affects the performance of software defect prediction (SDP) models. This drawback makes it necessary to apply feature selection (FS) algorithm(s) in SDP processes. FS approaches can be categorized into three types, namely, filter FS (FFS), wrapper FS (WFS), and hybrid FS (HFS). HFS has been established as superior because it combines the strength of both FFS and WFS methods. However, selecting the most appropriate FFS (filter rank selection problem) for HFS is a challenge because the performance of FFS methods depends on the choice of datasets and classifiers. In addition, the local optima stagnation and high computational costs of WFS due to large search spaces are inherited by the HFS method. Therefore, as a solution, this study proposes a novel rank aggregation-based hybrid multifilter wrapper feature selection (RAHMFWFS) method for the selection of relevant and irredundant features from software defect datasets. The proposed RAHMFWFS is divided into two stepwise stages. The first stage involves a rank aggregation-based multifilter feature selection (RMFFS) method that addresses the filter rank selection problem by aggregating individual rank lists from multiple filter methods, using a novel rank aggregation method to generate a single, robust, and non-disjoint rank list. In the second stage, the aggregated ranked features are further preprocessed by an enhanced wrapper feature selection (EWFS) method based on a dynamic reranking strategy that is used to guide the feature subset selection process of the HFS method. This, in turn, reduces the number of evaluation cycles while amplifying or maintaining its prediction performance. The feasibility of the proposed RAHMFWFS was demonstrated on benchmarked software defect datasets with Naïve Bayes and Decision Tree classifiers, based on accuracy, the area under the curve (AUC), and F-measure values. The experimental results showed the effectiveness of RAHMFWFS in addressing filter rank selection and local optima stagnation problems in HFS, as well as the ability to select optimal features from SDP datasets while maintaining or enhancing the performance of SDP models. To conclude, the proposed RAHMFWFS achieved good performance by improving the prediction performances of SDP models across the selected datasets, compared to existing state-of-the-arts HFS methods.


2021 ◽  
Vol 24 (68) ◽  
pp. 72-88
Author(s):  
Mohammad Alshayeb ◽  
Mashaan A. Alshammari

The ongoing development of computer systems requires massive software projects. Running the components of these huge projects for testing purposes might be a costly process; therefore, parameter estimation can be used instead. Software defect prediction models are crucial for software quality assurance. This study investigates the impact of dataset size and feature selection algorithms on software defect prediction models. We use two approaches to build software defect prediction models: a statistical approach and a machine learning approach with support vector machines (SVMs). The fault prediction model was built based on four datasets of different sizes. Additionally, four feature selection algorithms were used. We found that applying the SVM defect prediction model on datasets with a reduced number of measures as features may enhance the accuracy of the fault prediction model. Also, it directs the test effort to maintain the most influential set of metrics. We also found that the running time of the SVM fault prediction model is not consistent with dataset size. Therefore, having fewer metrics does not guarantee a shorter execution time. From the experiments, we found that dataset size has a direct influence on the SVM fault prediction model. However, reduced datasets performed the same or slightly lower than the original datasets.


Sign in / Sign up

Export Citation Format

Share Document