scholarly journals Research on Cross-Company Defect Prediction Method to Improve Software Security

2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Yanli Shao ◽  
Jingru Zhao ◽  
Xingqi Wang ◽  
Weiwei Wu ◽  
Jinglong Fang

As the scale and complexity of software increase, software security issues have become the focus of society. Software defect prediction (SDP) is an important means to assist developers in discovering and repairing potential defects that may endanger software security in advance and improving software security and reliability. Currently, cross-project defect prediction (CPDP) and cross-company defect prediction (CCDP) are widely studied to improve the defect prediction performance, but there are still problems such as inconsistent metrics and large differences in data distribution between source and target projects. Therefore, a new CCDP method based on metric matching and sample weight setting is proposed in this study. First, a clustering-based metric matching method is proposed. The multigranularity metric feature vector is extracted to unify the metric dimension while maximally retaining the information contained in the metrics. Then use metric clustering to eliminate metric redundancy and extract representative metrics through principal component analysis (PCA) to support one-to-one metric matching. This strategy not only solves the metric inconsistent and redundancy problem but also transforms the cross-company heterogeneous defect prediction problem into a homogeneous problem. Second, a sample weight setting method is proposed to transform the source data distribution. Wherein the statistical source sample frequency information is set as an impact factor to increase the weight of source samples that are more similar to the target samples, which improves the data distribution similarity between the source and target projects, thereby building a more accurate prediction model. Finally, after the above two-step processing, some classical machine learning methods are applied to build the prediction model, and 12 project datasets in NASA and PROMISE are used for performance comparison. Experimental results prove that the proposed method has superior prediction performance over other mainstream CCDP methods.

2021 ◽  
Vol 13 (8) ◽  
pp. 216
Author(s):  
Yu Zhao ◽  
Yi Zhu ◽  
Qiao Yu ◽  
Xiaoying Chen

Traditional research methods in software defect prediction use part of the data in the same project to train the defect prediction model and predict the defect label of the remaining part of the data. However, in the practical realm of software development, the software project that needs to be predicted is generally a brand new software project, and there is not enough labeled data to build a defect prediction model; therefore, traditional methods are no longer applicable. Cross-project defect prediction uses the labeled data of the same type of project similar to the target project to build the defect prediction model, so as to solve the problem of data loss in traditional methods. However, the difference in data distribution between the same type of project and the target project reduces the performance of defect prediction. To solve this problem, this paper proposes a cross-project defect prediction method based on manifold feature transformation. This method transforms the original feature space of the project into a manifold space, then reduces the difference in data distribution of the transformed source project and the transformed target project in the manifold space, and finally uses the transformed source project to train a naive Bayes prediction model with better performance. A comparative experiment was carried out using the Relink dataset and the AEEEM dataset. The experimental results show that compared with the benchmark method and several cross-project defect prediction methods, the proposed method effectively reduces the difference in data distribution between the source project and the target project, and obtains a higher F1 value, which is an indicator commonly used to measure the performance of the two-class model.


2011 ◽  
Vol 34 (6) ◽  
pp. 1148-1154 ◽  
Author(s):  
Hui-Yan JIANG ◽  
Mao ZONG ◽  
Xiang-Ying LIU

2006 ◽  
Vol 1 (1) ◽  
Author(s):  
K. Katayama ◽  
K. Kimijima ◽  
O. Yamanaka ◽  
A. Nagaiwa ◽  
Y. Ono

This paper proposes a method of stormwater inflow prediction using radar rainfall data as the input of the prediction model constructed by system identification. The aim of the proposal is to construct a compact system by reducing the dimension of the input data. In this paper, Principal Component Analysis (PCA), which is widely used as a statistical method for data analysis and compression, is applied to pre-processing radar rainfall data. Then we evaluate the proposed method using the radar rainfall data and the inflow data acquired in a certain combined sewer system. This study reveals that a few principal components of radar rainfall data can be appropriate as the input variables to storm water inflow prediction model. Consequently, we have established a procedure for the stormwater prediction method using a few principal components of radar rainfall data.


2020 ◽  
Vol 10 (23) ◽  
pp. 8324
Author(s):  
Yumei Wu ◽  
Jingxiu Yao ◽  
Shuo Chang ◽  
Bin Liu

Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classifiers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.


2018 ◽  
Vol 8 (12) ◽  
pp. 2383 ◽  
Author(s):  
Zhehan Chen ◽  
Xianhui Zong ◽  
Jing Shi ◽  
Xiaohua Zhang

Selective laser sintering (SLS) is an additive manufacturing technology that can work with a variety of metal materials, and has been widely employed in many applications. The establishment of a data correlation model through the analysis of temperature field images is a recognized research method to realize the monitoring and quality control of the SLS process. In this paper, the key features of the temperature field in the process are extracted from three levels, and the mathematical model and data structure of the key features are constructed. Feature extraction, dimensional reduction, and parameter optimization are realized based on principal component analysis (PCA) and support vector machine (SVM), and the prediction model is built and optimized. Finally, the feasibility of the proposed algorithms and model is verified by experiments.


Author(s):  
Chakkrit Tantithamthavorn ◽  
Shane McIntosh ◽  
Ahmed E Hassan ◽  
Kenichi Matsumoto

Shepperd et al. (2014) find that the reported performance of a defect prediction model shares a strong relationship with the group of researchers who construct the models. In this paper, we perform an alternative investigation of Shepperd et al. (2014)’s data. We observe that (a) researcher group shares a strong association with the dataset and metric families that are used to build a model; (b) the strong association among the explanatory variables introduces a large amount of interference when interpreting the impact of the researcher group on model performance; and (c) after mitigating the interference, we find that the researcher group has a smaller impact than the metric family. These observations lead us to conclude that the relationship between the researcher group and the performance of a defect prediction model may have more to do with the tendency of researchers to reuse experimental components (e.g., datasets and metrics). We recommend that researchers experiment with a broader selection of datasets and metrics to combat potential bias in their results.


Sign in / Sign up

Export Citation Format

Share Document