Research on Cross-Company Defect Prediction Method to Improve Software Security

Security and Communication Networks ◽

10.1155/2021/5558561 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Yanli Shao ◽

Jingru Zhao ◽

Xingqi Wang ◽

Weiwei Wu ◽

Jinglong Fang

Keyword(s):

Prediction Model ◽

Software Security ◽

Homogeneous Problem ◽

Data Distribution ◽

Principal Component ◽

Prediction Performance ◽

Defect Prediction ◽

Metric Dimension ◽

Sample Weight ◽

Metric Matching

As the scale and complexity of software increase, software security issues have become the focus of society. Software defect prediction (SDP) is an important means to assist developers in discovering and repairing potential defects that may endanger software security in advance and improving software security and reliability. Currently, cross-project defect prediction (CPDP) and cross-company defect prediction (CCDP) are widely studied to improve the defect prediction performance, but there are still problems such as inconsistent metrics and large differences in data distribution between source and target projects. Therefore, a new CCDP method based on metric matching and sample weight setting is proposed in this study. First, a clustering-based metric matching method is proposed. The multigranularity metric feature vector is extracted to unify the metric dimension while maximally retaining the information contained in the metrics. Then use metric clustering to eliminate metric redundancy and extract representative metrics through principal component analysis (PCA) to support one-to-one metric matching. This strategy not only solves the metric inconsistent and redundancy problem but also transforms the cross-company heterogeneous defect prediction problem into a homogeneous problem. Second, a sample weight setting method is proposed to transform the source data distribution. Wherein the statistical source sample frequency information is set as an impact factor to increase the weight of source samples that are more similar to the target samples, which improves the data distribution similarity between the source and target projects, thereby building a more accurate prediction model. Finally, after the above two-step processing, some classical machine learning methods are applied to build the prediction model, and 12 project datasets in NASA and PROMISE are used for performance comparison. Experimental results prove that the proposed method has superior prediction performance over other mainstream CCDP methods.

Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

Future Internet ◽

10.3390/fi13080216 ◽

2021 ◽

Vol 13 (8) ◽

pp. 216

Author(s):

Yu Zhao ◽

Yi Zhu ◽

Qiao Yu ◽

Xiaoying Chen

Keyword(s):

Prediction Model ◽

Data Distribution ◽

Prediction Method ◽

Feature Space ◽

Defect Prediction ◽

Software Project ◽

Feature Transformation ◽

Traditional Methods ◽

The Difference ◽

Cross Project

Traditional research methods in software defect prediction use part of the data in the same project to train the defect prediction model and predict the defect label of the remaining part of the data. However, in the practical realm of software development, the software project that needs to be predicted is generally a brand new software project, and there is not enough labeled data to build a defect prediction model; therefore, traditional methods are no longer applicable. Cross-project defect prediction uses the labeled data of the same type of project similar to the target project to build the defect prediction model, so as to solve the problem of data loss in traditional methods. However, the difference in data distribution between the same type of project and the target project reduces the performance of defect prediction. To solve this problem, this paper proposes a cross-project defect prediction method based on manifold feature transformation. This method transforms the original feature space of the project into a manifold space, then reduces the difference in data distribution of the transformed source project and the transformed target project in the manifold space, and finally uses the transformed source project to train a naive Bayes prediction model with better performance. A comparative experiment was carried out using the Relink dataset and the AEEEM dataset. The experimental results show that compared with the benchmark method and several cross-project defect prediction methods, the proposed method effectively reduces the difference in data distribution between the source project and the target project, and obtains a higher F1 value, which is an indicator commonly used to measure the performance of the two-class model.

Empirical Evaluation of Defect Prediction Model - ODC in a Portal Server

i-manager’s Journal on Software Engineering ◽

10.26634/jse.4.2.1068 ◽

2009 ◽

Vol 4 (2) ◽

pp. 7-15

Author(s):

P. Kabilan ◽

K. Iyakutti

Keyword(s):

Prediction Model ◽

Empirical Evaluation ◽

Defect Prediction

Research of Software Defect Prediction Model Based on ACO-SVM

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2011.01148 ◽

2011 ◽

Vol 34 (6) ◽

pp. 1148-1154 ◽

Cited By ~ 13

Author(s):

Hui-Yan JIANG ◽

Mao ZONG ◽

Xiang-Ying LIU

Keyword(s):

Prediction Model ◽

Defect Prediction ◽

Software Defect Prediction ◽

Model Based ◽

Software Defect

Stormwater inflow prediction using radar rainfall data compressed by principal component analysis

Water Practice & Technology ◽

10.2166/wpt.2006.017 ◽

2006 ◽

Vol 1 (1) ◽

Author(s):

K. Katayama ◽

K. Kimijima ◽

O. Yamanaka ◽

A. Nagaiwa ◽

Y. Ono

Keyword(s):

Principal Component Analysis ◽

Prediction Model ◽

Principal Components ◽

Prediction Method ◽

Principal Component ◽

Component Analysis ◽

Rainfall Data ◽

Radar Rainfall ◽

Input Variables ◽

Inflow Prediction

This paper proposes a method of stormwater inflow prediction using radar rainfall data as the input of the prediction model constructed by system identification. The aim of the proposal is to construct a compact system by reducing the dimension of the input data. In this paper, Principal Component Analysis (PCA), which is widely used as a statistical method for data analysis and compression, is applied to pre-processing radar rainfall data. Then we evaluate the proposed method using the radar rainfall data and the inflow data acquired in a certain combined sewer system. This study reveals that a few principal components of radar rainfall data can be appropriate as the input variables to storm water inflow prediction model. Consequently, we have established a procedure for the stormwater prediction method using a few principal components of radar rainfall data.

Establishing a software defect prediction model via effective dimension reduction

Information Sciences ◽

10.1016/j.ins.2018.10.056 ◽

2019 ◽

Vol 477 ◽

pp. 399-409 ◽

Cited By ~ 7

Author(s):

Hua Wei ◽

Changzhen Hu ◽

Shiyou Chen ◽

Yuan Xue ◽

Quanxin Zhang

Keyword(s):

Prediction Model ◽

Dimension Reduction ◽

Defect Prediction ◽

Software Defect Prediction ◽

Effective Dimension ◽

Software Defect ◽

Effective Dimension Reduction

Defect prediction model of static code features for cross-company and cross-project software

International Journal of Information Technology ◽

10.1007/s41870-018-0262-5 ◽

2018 ◽

Author(s):

Satwinder Singh ◽

Rozy Singla

Keyword(s):

Prediction Model ◽

Defect Prediction ◽

Cross Project

LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

Applied Sciences ◽

10.3390/app10238324 ◽

2020 ◽

Vol 10 (23) ◽

pp. 8324

Author(s):

Yumei Wu ◽

Jingxiu Yao ◽

Shuo Chang ◽

Bin Liu

Keyword(s):

Sample Size ◽

Naive Bayes ◽

Data Distribution ◽

Naïve Bayes ◽

Large Datasets ◽

Small Sample ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Imbalance Learning

Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classifiers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.

PANK-A financial time series prediction model integrating principal component analysis, affinity propagation clustering and nested k-nearest neighbor regression

Journal of Interdisciplinary Mathematics ◽

10.1080/09720502.2018.1456825 ◽

2018 ◽

Vol 21 (3) ◽

pp. 717-728 ◽

Cited By ~ 5

Author(s):

Li Tang ◽

Heping Pan ◽

Yiyong Yao

Keyword(s):

Principal Component Analysis ◽

Time Series ◽

Prediction Model ◽

Nearest Neighbor ◽

Financial Time Series ◽

Time Series Prediction ◽

Principal Component ◽

K Nearest Neighbor ◽

Financial Time ◽

Affinity Propagation Clustering

Online Monitoring Based on Temperature Field Features and Prediction Model for Selective Laser Sintering Process

Applied Sciences ◽

10.3390/app8122383 ◽

2018 ◽

Vol 8 (12) ◽

pp. 2383 ◽

Cited By ~ 4

Author(s):

Zhehan Chen ◽

Xianhui Zong ◽

Jing Shi ◽

Xiaohua Zhang

Keyword(s):

Temperature Field ◽

Prediction Model ◽

Selective Laser Sintering ◽

Principal Component ◽

Laser Sintering ◽

Support Vector ◽

Sintering Process ◽

Metal Materials ◽

Key Features ◽

The Mathematical Model

Selective laser sintering (SLS) is an additive manufacturing technology that can work with a variety of metal materials, and has been widely employed in many applications. The establishment of a data correlation model through the analysis of temperature field images is a recognized research method to realize the monitoring and quality control of the SLS process. In this paper, the key features of the temperature field in the process are extracted from three levels, and the mathematical model and data structure of the key features are constructed. Feature extraction, dimensional reduction, and parameter optimization are realized based on principal component analysis (PCA) and support vector machine (SVM), and the prediction model is built and optimized. Finally, the feasibility of the proposed algorithms and model is verified by experiments.

Comments on "Researcher bias: The use of machine learning in software defect prediction"

10.7287/peerj.preprints.1260v1 ◽

2015 ◽

Cited By ~ 1

Author(s):

Chakkrit Tantithamthavorn ◽

Shane McIntosh ◽

Ahmed E Hassan ◽

Kenichi Matsumoto

Keyword(s):

Prediction Model ◽

Strong Association ◽

Model Performance ◽

Strong Relationship ◽

Defect Prediction ◽

Explanatory Variables ◽

Software Defect ◽

The Impact ◽

The Relationship ◽

Selection Of

Shepperd et al. (2014) find that the reported performance of a defect prediction model shares a strong relationship with the group of researchers who construct the models. In this paper, we perform an alternative investigation of Shepperd et al. (2014)’s data. We observe that (a) researcher group shares a strong association with the dataset and metric families that are used to build a model; (b) the strong association among the explanatory variables introduces a large amount of interference when interpreting the impact of the researcher group on model performance; and (c) after mitigating the interference, we find that the researcher group has a smaller impact than the metric family. These observations lead us to conclude that the relationship between the researcher group and the performance of a defect prediction model may have more to do with the tendency of researchers to reuse experimental components (e.g., datasets and metrics). We recommend that researchers experiment with a broader selection of datasets and metrics to combat potential bias in their results.