scholarly journals Impact of imbalanced data on the performance of software defect prediction classifiers

2019 ◽  
Vol 1345 ◽  
pp. 022026
Author(s):  
Lichao Wang ◽  
Wei Wang ◽  
Bingyou Liu ◽  
Shuqiao Geng
2017 ◽  
Vol 102 (2) ◽  
pp. 937-950 ◽  
Author(s):  
Lijuan Zhou ◽  
Ran Li ◽  
Shudong Zhang ◽  
Hua Wang

Author(s):  
Hongyan Wan ◽  
Guoqing Wu ◽  
Mali Yu ◽  
Mengting Yuan

Software defect prediction technology has been widely used in improving the quality of software system. Most real software defect datasets tend to have fewer defective modules than defective-free modules. Highly class-imbalanced data typically make accurate predictions difficult. The imbalanced nature of software defect datasets makes the prediction model classifying a defective module as a defective-free one easily. As there exists the similarity during the different software modules, one module can be represented by the sparse representation coefficients over the pre-defined dictionary which consists of historical software defect datasets. In this study, we make use of dictionary learning method to predict software defect. We optimize the classifier parameters and the dictionary atoms iteratively, to ensure that the extracted features (sparse representation) are optimal for the trained classifier. We prove the optimal condition of the elastic net which is used to solve the sparse coding coefficients and the regularity of the elastic net solution. Due to the reason that the misclassification of defective modules generally incurs much higher cost risk than the misclassification of defective-free ones, we take the different misclassification costs into account, increasing the punishment on misclassification defective modules in the procedure of dictionary learning, making the classification inclining to classify a module as a defective one. Thus, we propose a cost-sensitive software defect prediction method using dictionary learning (CSDL). Experimental results on the 10 class-imbalance datasets of NASA show that our method is more effective than several typical state-of-the-art defect prediction methods.


2019 ◽  
Vol 8 (3) ◽  
pp. 8683-8687

Prediction of software defects is a highly researched and important domain for cost - saving advantage in software development. Different methods of classification using attributes of static code were used to predict defects in software.However, the defective instances count is very minimal compared to the count of non - defective instances and this leads to imbalanced data, where the ratio of data class is not equal. For such data, conventional machine learning techniques give poor results.While there are different strategies to address this issue, normal oversampling methods are different versions of the SMOTE algorithm, These approaches are based on local information,instead of the complete distribution of minority class.GANs is used to approximate the true data distribution of minority class data used for software defect prediction.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 569
Author(s):  
Timing Li ◽  
Lei Yang ◽  
Kewen Li ◽  
Jiannan Zhai

Imbalanced data and feature redundancies are common problems in many fields, especially in software defect prediction, data mining, machine learning, and industrial big data application. To resolve these problems, we propose an intelligent fusion algorithm, SMPSO-HS-AdaBoost, which combines particle swarm optimization based on subgroup migration and adaptive boosting based on hybrid-sampling. In this paper, we apply the proposed intelligent fusion algorithm to software defect prediction to improve the prediction efficiency and accuracy by solving the issues caused by imbalanced data and feature redundancies. The results show that the proposed algorithm resolves the coexisting problems of imbalanced data and feature redundancies, and ensures the efficiency and accuracy of software defect prediction.


Sign in / Sign up

Export Citation Format

Share Document