scholarly journals A public unified bug dataset for java and its assessment regarding metrics and bug prediction

2020 ◽  
Vol 28 (4) ◽  
pp. 1447-1506 ◽  
Author(s):  
Rudolf Ferenc ◽  
Zoltán Tóth ◽  
Gergely Ladányi ◽  
István Siket ◽  
Tibor Gyimóthy

AbstractBug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. We investigated the diversion of metric definitions and values of the different bug datasets. Finally, we used a decision tree algorithm to show the capabilities of the dataset in bug prediction. We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction related investigations, researchers can make their studies reproducible, thus able to be validated and verified.

2019 ◽  
Vol 8 (2) ◽  
pp. 5888-5895

Natural language processing on software systems usually contain high dimensional noisy and irrelevant features which lead to inaccurate and poor contextual similarity between the project source code and its API documentation. Most of the traditional source code analysis models are independent of finding and extracting the relevant features for contextual similarity. As the size of the project source code and its related API documentation increases, these models incorporate the contextual similarity between the source code and API documentation for code analysis. One of the best solutions for this problem is finding the essential features using the source code dependency graph. In this paper, the dependency graph is used to compute the contextual similarity computation between the source code metrics and its API documents. A novel contextual similarity measure is used to find the relationship between the project source code metrics to the API documents. Proposed model is evaluated on different project source codes and API documents in terms of pre-processing, context similarity and runtime. Experimental results show that the proposed model has high computational efficiency compared to the existing models on the large size datasets


2015 ◽  
Author(s):  
Muthukumaran Kasinathan ◽  
Lalita Bhanu Murthy Neti

Several change metrics and source code metrics have been introduced and proved to be effective in bug prediction. Researchers performed comparative studies of bug prediction models built using the individual metrics as well as combination of these metrics. In this paper, we investigate the impact of feature selection in bug prediction models by analyzing the misclassification rates of these models with and without feature selection in place. We conduct our experiments on five open source projects by considering numerous change metrics and source code metrics. And this study aims to figure out the reliable subset of metrics that are common amongst all projects.


2021 ◽  
pp. 026-035
Author(s):  
A.M. Pokrovskyi ◽  
◽  

The rapid development of software quality measurement methods, the need in efficient and versatile reengineering automatization tools becomes increasingly bigger. This becomes even more apparent when the programming language and respective coding practices slowly develop alongside each other for a long period of time, while the legacy code base grows bigger and remains highly relevant. In this paper, a source code metrics measurement tool for Fortran program quality evaluation is developed. It is implemented as a code module for Photran integrated development environment and based on a set of syntax tree walking algorithms. The module utilizes the built-in Photran syntax analysis engine and the tree data structure which it builds from the source code. The developed tool is also compared to existing source code analysis instruments. The results show that the developed tool is most effective when used in combination with Photran’s built-in refactoring system, and that Photran’s application programming interface facilitates easy scaling of the existing infrastructure by introducing other code analysis methods.


2015 ◽  
Author(s):  
Muthukumaran Kasinathan ◽  
Lalita Bhanu Murthy Neti

Several change metrics and source code metrics have been introduced and proved to be effective in bug prediction. Researchers performed comparative studies of bug prediction models built using the individual metrics as well as combination of these metrics. In this paper, we investigate the impact of feature selection in bug prediction models by analyzing the misclassification rates of these models with and without feature selection in place. We conduct our experiments on five open source projects by considering numerous change metrics and source code metrics. And this study aims to figure out the reliable subset of metrics that are common amongst all projects.


Technologies ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 3
Author(s):  
Gábor Antal ◽  
Zoltán Tóth ◽  
Péter Hegedűs ◽  
Rudolf Ferenc

Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results.


Author(s):  
Oscar E. Perez-Cham ◽  
Carlos Soubervielle- Montalvo ◽  
Alberto S. Nunez-Varela ◽  
Cesar Puente ◽  
Luis J. Ontanon-Garcia

Author(s):  
Paulo Meirelles ◽  
Carlos Santos Jr. ◽  
Joao Miranda ◽  
Fabio Kon ◽  
Antonio Terceiro ◽  
...  

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Dejun Jiang ◽  
Zhenxing Wu ◽  
Chang-Yu Hsieh ◽  
Guangyong Chen ◽  
Ben Liao ◽  
...  

AbstractGraph neural networks (GNN) has been considered as an attractive modelling method for molecular property prediction, and numerous studies have shown that GNN could yield more promising results than traditional descriptor-based methods. In this study, based on 11 public datasets covering various property endpoints, the predictive capacity and computational efficiency of the prediction models developed by eight machine learning (ML) algorithms, including four descriptor-based models (SVM, XGBoost, RF and DNN) and four graph-based models (GCN, GAT, MPNN and Attentive FP), were extensively tested and compared. The results demonstrate that on average the descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency. SVM generally achieves the best predictions for the regression tasks. Both RF and XGBoost can achieve reliable predictions for the classification tasks, and some of the graph-based models, such as Attentive FP and GCN, can yield outstanding performance for a fraction of larger or multi-task datasets. In terms of computational cost, XGBoost and RF are the two most efficient algorithms and only need a few seconds to train a model even for a large dataset. The model interpretations by the SHAP method can effectively explore the established domain knowledge for the descriptor-based models. Finally, we explored use of these models for virtual screening (VS) towards HIV and demonstrated that different ML algorithms offer diverse VS profiles. All in all, we believe that the off-the-shelf descriptor-based models still can be directly employed to accurately predict various chemical endpoints with excellent computability and interpretability.


2019 ◽  
Vol 2019 (2) ◽  
pp. 117-126
Author(s):  
Chinmay Hota ◽  
Lov Kumar ◽  
Lalita Bhanu Murthy Neti

Sign in / Sign up

Export Citation Format

Share Document