INCREMENTAL DEVELOPMENT OF FAULT PREDICTION MODELS

Author(s):  
YUE JIANG ◽  
BOJAN CUKIC ◽  
TIM MENZIES ◽  
JIE LIN

The identification of fault-prone modules has a significant impact on software quality assurance. In addition to prediction accuracy, one of the most important goals is to detect fault prone modules as early as possible in the development lifecycle. Requirements, design, and code metrics have been successfully used for predicting fault-prone modules. In this paper, we investigate the benefits of the incremental development of software fault prediction models. We compare the performance of these models as the volume of data and their life cycle origin (design, code, or their combination) evolve during project development. We analyze 14 data sets from publicly available software engineering data repositories. These data sets offer both design and code metrics. Using a number of modeling techniques and statistical significance tests, we confirm that increasing the volume of training data improves model performance. Further models built from code metrics typically outperform those that are built using design metrics only. However, both types of models prove to be useful as they can be constructed in different phases of the life cycle. Code-based models can be used to increase the effectiveness of assigning verification and validation activities late in the development life cycle. We also conclude that models that utilize a combination of design and code level metrics outperform models which use either one metric set exclusively.

2021 ◽  
Vol 7 ◽  
pp. e722
Author(s):  
Syed Rashid Aziz ◽  
Tamim Ahmed Khan ◽  
Aamer Nadeem

Fault prediction is a necessity to deliver high-quality software. The absence of training data and mechanism to labeling a cluster faulty or fault-free is a topic of concern in software fault prediction (SFP). Inheritance is an important feature of object-oriented development, and its metrics measure the complexity, depth, and breadth of software. In this paper, we aim to experimentally validate how much inheritance metrics are helpful to classify unlabeled data sets besides conceiving a novel mechanism to label a cluster as faulty or fault-free. We have collected ten public data sets that have inheritance and C&K metrics. Then, these base datasets are further split into two datasets labeled as C&K with inheritance and the C&K dataset for evaluation. K-means clustering is applied, Euclidean formula to compute distances and then label clusters through the average mechanism. Finally, TPR, Recall, Precision, F1 measures, and ROC are computed to measure performance which showed an adequate impact of inheritance metrics in SFP specifically classifying unlabeled datasets and correct classification of instances. The experiment also reveals that the average mechanism is suitable to label clusters in SFP. The quality assurance practitioners can benefit from the utilization of metrics associated with inheritance for labeling datasets and clusters.


2012 ◽  
pp. 371-387 ◽  
Author(s):  
Cagatay Catal ◽  
Soumya Banerjee

Artificial Immune Systems, a biologically inspired computing paradigm such as Artificial Neural Networks, Genetic Algorithms, and Swarm Intelligence, embody the principles and advantages of vertebrate immune systems. It has been applied to solve several complex problems in different areas such as data mining, computer security, robotics, aircraft control, scheduling, optimization, and pattern recognition. There is an increasing interest in the use of this paradigm and they are widely used in conjunction with other methods such as Artificial Neural Networks, Swarm Intelligence and Fuzzy Logic. In this chapter, we demonstrate the procedure for applying this paradigm and bio-inspired algorithm for developing software fault prediction models. The fault prediction unit is to identify the modules, which are likely to contain the faults at the next release in a large software system. Software metrics and fault data belonging to a previous software version are used to build the model. Fault-prone modules of the next release are predicted by using this model and current software metrics. From machine learning perspective, this type of modeling approach is called supervised learning. A sample fault dataset is used to show the elaborated approach of working of Artificial Immune Recognition Systems (AIRS).


Author(s):  
Golnoush Abaei ◽  
Ali Selamat

Quality assurance tasks such as testing, verification and validation, fault tolerance, and fault prediction play a major role in software engineering activities. Fault prediction approaches are used when a software company needs to deliver a finished product while it has limited time and budget for testing it. In such cases, identifying and testing parts of the system that are more defect prone is reasonable. In fact, prediction models are mainly used for improving software quality and exploiting available resources. Software fault prediction is studied in this chapter based on different criteria that matters in this research field. Usually, there are certain issues that need to be taken care of such as different machine-learning techniques, artificial intelligence classifiers, variety of software metrics, distinctive performance evaluation metrics, and some statistical analysis. In this chapter, the authors present a roadmap for those researchers who are interested in working in this area. They illustrate problems along with objectives related to each mentioned criterion, which could assist researchers to build the finest software fault prediction model.


Author(s):  
Wasiur Rhmann ◽  
Gufran Ahmad Ansari

Software engineering repositories have been attracted by researchers to mine useful information about the different quality attributes of the software. These repositories have been helpful to software professionals to efficiently allocate various resources in the life cycle of software development. Software fault prediction is a quality assurance activity. In fault prediction, software faults are predicted before actual software testing. As exhaustive software testing is impossible, the use of software fault prediction models can help the proper allocation of testing resources. Various machine learning techniques have been applied to create software fault prediction models. In this study, ensemble models are used for software fault prediction. Change metrics-based data are collected for an open-source android project from GIT repository and code-based metrics data are obtained from PROMISE data repository and datasets kc1, kc2, cm1, and pc1 are used for experimental purpose. Results showed that ensemble models performed better compared to machine learning and hybrid search-based algorithms. Bagging ensemble was found to be more effective in the prediction of faults in comparison to soft and hard voting.


Author(s):  
Baojun Ma ◽  
Huaping Zhang ◽  
Guoqing Chen ◽  
Yanping Zhao ◽  
Bart Baesens

It is a recurrent finding that software development is often troubled by considerable delays as well as budget overruns and several solutions have been proposed in answer to this observation, software fault prediction being a prime example. Drawing upon machine learning techniques, software fault prediction tries to identify upfront software modules that are most likely to contain faults, thereby streamlining testing efforts and improving overall software quality. When deploying fault prediction models in a production environment, both prediction performance and model comprehensibility are typically taken into consideration, although the latter is commonly overlooked in the academic literature. Many classification methods have been suggested to conduct fault prediction; yet associative classification methods remain uninvestigated in this context. This paper proposes an associative classification (AC)-based fault prediction method, building upon the CBA2 algorithm. In an empirical comparison on 12 real-world datasets, the AC-based classifier is shown to achieve a predictive performance competitive to those of models induced by five other tree/rule-based classification techniques. In addition, our findings also highlight the comprehensibility of the AC-based models, while achieving similar prediction performance. Furthermore, the possibilities of cross project prediction are investigated, strengthening earlier findings on the feasibility of such approach when insufficient data on the target project is available.


Forests ◽  
2018 ◽  
Vol 10 (1) ◽  
pp. 20 ◽  
Author(s):  
Philipp Kilham ◽  
Christoph Hartebrodt ◽  
Gerald Kändler

Wood supply predictions from forest inventories involve two steps. First, it is predicted whether harvests occur on a plot in a given time period. Second, for plots on which harvests are predicted to occur, the harvested volume is predicted. This research addresses this second step. For forests with more than one species and/or forests with trees of varying dimensions, overall harvested volume predictions are not satisfactory and more detailed predictions are required. The study focuses on southwest Germany where diverse forest types are found. Predictions are conducted for plots on which harvests occurred in the 2002–2012 period. For each plot, harvest probabilities of sample trees are predicted and used to derive the harvested volume (m³ over bark in 10 years) per hectare. Random forests (RFs) have become popular prediction models as they define the interactions and relationships of variables in an automatized way. However, their suitability for predicting harvest probabilities for inventory sample trees is questionable and has not yet been examined. Generalized linear mixed models (GLMMs) are suitable in this context as they can account for the nested structure of tree-level data sets (trees nested in plots). It is unclear if RFs can cope with this data structure. This research aims to clarify this question by comparing two RFs—an RF based on conditional inference trees (CTree-RF), and an RF based on classification and regression trees (CART-RF)—with a GLMM. For this purpose, the models were fitted on training data and evaluated on an independent test set. Both RFs achieved better prediction results than the GLMM. Regarding plot-level harvested volumes per ha, they achieved higher variances explained (VEs) and significantly (p < 0.05) lower mean absolute residuals when compared to the GLMM. VEs were 0.38 (CTree-RF), 0.37 (CART-RF), and 0.31 (GLMM). Root means squared errors were 138.3, 139.9 and 145.5, respectively. The research demonstrates the suitability and advantages of RFs for predicting harvest decisions on the level of inventory sample trees. RFs can become important components within the generation of business-as-usual wood supply scenarios worldwide as they are able to learn and predict harvest decisions from NFIs in an automatized and self-adapting way. The applied approach is not restricted to specific forests or harvest regimes and delivers detailed species and dimension information for the harvested volumes.


Sign in / Sign up

Export Citation Format

Share Document