The necessity of assuring quality in software measurement data

Author(s):  
T.M. Khoshgoftaar
Author(s):  
Kehan Gao ◽  
Taghi M. Khoshgoftaar

In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.


2008 ◽  
Vol 16 (4) ◽  
pp. 563-600 ◽  
Author(s):  
Taghi M. Khoshgoftaar ◽  
Jason Van Hulse

2004 ◽  
Vol 19 (2) ◽  
pp. 20-27 ◽  
Author(s):  
S. Zhong ◽  
T.M. Khoshgoftaar ◽  
N. Seliya

2001 ◽  
Vol 27 (9) ◽  
pp. 788-804 ◽  
Author(s):  
B.A. Kitchenham ◽  
R.T. Hughes ◽  
S.G. Linkman

Sign in / Sign up

Export Citation Format

Share Document