A public unified bug dataset for java and its assessment regarding metrics and bug prediction

Rudolf Ferenc; Zoltán Tóth; Gergely Ladányi; István Siket; Tibor Gyimóthy

doi:10.1007/s11219-020-09515-0

A public unified bug dataset for java and its assessment regarding metrics and bug prediction

Software Quality Journal ◽

10.1007/s11219-020-09515-0 ◽

2020 ◽

Vol 28 (4) ◽

pp. 1447-1506 ◽

Cited By ~ 1

Author(s):

Rudolf Ferenc ◽

Zoltán Tóth ◽

Gergely Ladányi ◽

István Siket ◽

Tibor Gyimóthy

Keyword(s):

Prediction Models ◽

Source Code ◽

Decision Tree Algorithm ◽

Large Dataset ◽

Code Analysis ◽

Project Learning ◽

Code Metrics ◽

Public Datasets ◽

Source Code Metrics ◽

Cross Project

AbstractBug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain a common set of source code metrics. This way, we produced a unified bug dataset at class and file level as well. We investigated the diversion of metric definitions and values of the different bug datasets. Finally, we used a decision tree algorithm to show the capabilities of the dataset in bug prediction. We found that there are statistically significant differences in the values of the original and the newly calculated metrics; furthermore, notations and definitions can severely differ. We compared the bug prediction capabilities of the original and the extended metric suites (within-project learning). Afterwards, we merged all classes (and files) into one large dataset which consists of 47,618 elements (43,744 for files) and we evaluated the bug prediction model build on this large dataset as well. Finally, we also investigated cross-project capabilities of the bug prediction models and datasets. We made the unified dataset publicly available for everyone. By using a public unified dataset as an input for different bug prediction related investigations, researchers can make their studies reproducible, thus able to be validated and verified.

Download Full-text

A Conceptual Dependency Graph Based Keyword Extraction Model for Source Code to API Documentation Mapping

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1092.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 5888-5895

Keyword(s):

Language Processing ◽

Source Code ◽

Dependency Graph ◽

Software Systems ◽

Code Analysis ◽

Proposed Model ◽

Context Similarity ◽

Code Metrics ◽

Api Documentation ◽

Source Code Metrics

Natural language processing on software systems usually contain high dimensional noisy and irrelevant features which lead to inaccurate and poor contextual similarity between the project source code and its API documentation. Most of the traditional source code analysis models are independent of finding and extracting the relevant features for contextual similarity. As the size of the project source code and its related API documentation increases, these models incorporate the contextual similarity between the source code and API documentation for code analysis. One of the best solutions for this problem is finding the essential features using the source code dependency graph. In this paper, the dependency graph is used to compute the contextual similarity computation between the source code metrics and its API documents. A novel contextual similarity measure is used to find the relationship between the project source code metrics to the API documents. Proposed model is evaluated on different project source codes and API documents in terms of pre-processing, context similarity and runtime. Experimental results show that the proposed model has high computational efficiency compared to the existing models on the large size datasets

Download Full-text

Impact of restricted forward greedy feature selection technique on bug prediction

10.7287/peerj.preprints.1411v1 ◽

2015 ◽

Author(s):

Muthukumaran Kasinathan ◽

Lalita Bhanu Murthy Neti

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Source Code ◽

Feature Selection Technique ◽

Code Metrics ◽

Change Metrics ◽

Misclassification Rates ◽

The Individual ◽

The Impact ◽

Source Code Metrics

Several change metrics and source code metrics have been introduced and proved to be effective in bug prediction. Researchers performed comparative studies of bug prediction models built using the individual metrics as well as combination of these metrics. In this paper, we investigate the impact of feature selection in bug prediction models by analyzing the misclassification rates of these models with and without feature selection in place. We conduct our experiments on five open source projects by considering numerous change metrics and source code metrics. And this study aims to figure out the reliable subset of metrics that are common amongst all projects.

Download Full-text

A tool to measure Fortran source code metrics using syntax analysis

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2021.01.026 ◽

2021 ◽

pp. 026-035

Author(s):

A.M. Pokrovskyi ◽

◽

Keyword(s):

Rapid Development ◽

Source Code ◽

Quality Measurement ◽

Fortran Program ◽

Measurement Tool ◽

Development Environment ◽

Syntax Analysis ◽

Code Analysis ◽

Code Metrics ◽

Source Code Metrics

The rapid development of software quality measurement methods, the need in efficient and versatile reengineering automatization tools becomes increasingly bigger. This becomes even more apparent when the programming language and respective coding practices slowly develop alongside each other for a long period of time, while the legacy code base grows bigger and remains highly relevant. In this paper, a source code metrics measurement tool for Fortran program quality evaluation is developed. It is implemented as a code module for Photran integrated development environment and based on a set of syntax tree walking algorithms. The module utilizes the built-in Photran syntax analysis engine and the tree data structure which it builds from the source code. The developed tool is also compared to existing source code analysis instruments. The results show that the developed tool is most effective when used in combination with Photran’s built-in refactoring system, and that Photran’s application programming interface facilitates easy scaling of the existing infrastructure by introducing other code analysis methods.

Download Full-text

Impact of restricted forward greedy feature selection technique on bug prediction

10.7287/peerj.preprints.1411 ◽

2015 ◽

Author(s):

Muthukumaran Kasinathan ◽

Lalita Bhanu Murthy Neti

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Source Code ◽

Feature Selection Technique ◽

Code Metrics ◽

Change Metrics ◽

Misclassification Rates ◽

The Individual ◽

The Impact ◽

Source Code Metrics

Download Full-text

Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics

Technologies ◽

10.3390/technologies9010003 ◽

2020 ◽

Vol 9 (1) ◽

pp. 3

Author(s):

Gábor Antal ◽

Zoltán Tóth ◽

Péter Hegedűs ◽

Rudolf Ferenc

Keyword(s):

Software Maintenance ◽

Positive Impact ◽

Source Code ◽

Code Analysis ◽

Static Source ◽

Static Code Analysis ◽

Function Calls ◽

Hybrid Code ◽

Code Metrics ◽

Scripting Language

Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results.

Download Full-text

Source Code Metrics to Predict the Properties of FPGA/VHDL-Based Synthesized Products

2018 6th International Conference in Software Engineering Research and Innovation (CONISOFT) ◽

10.1109/conisoft.2018.8645854 ◽

2018 ◽

Author(s):

Oscar E. Perez-Cham ◽

Carlos Soubervielle- Montalvo ◽

Alberto S. Nunez-Varela ◽

Cesar Puente ◽

Luis J. Ontanon-Garcia

Keyword(s):

Source Code ◽

Code Metrics ◽

Source Code Metrics

Download Full-text

A Study of the Relationships between Source Code Metrics and Attractiveness in Free Software Projects

2010 Brazilian Symposium on Software Engineering ◽

10.1109/sbes.2010.27 ◽

2010 ◽

Cited By ~ 18

Author(s):

Paulo Meirelles ◽

Carlos Santos Jr. ◽

Joao Miranda ◽

Fabio Kon ◽

Antonio Terceiro ◽

...

Keyword(s):

Source Code ◽

Free Software ◽

Software Projects ◽

Code Metrics ◽

Source Code Metrics

Download Full-text

Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models

Journal of Cheminformatics ◽

10.1186/s13321-020-00479-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Dejun Jiang ◽

Zhenxing Wu ◽

Chang-Yu Hsieh ◽

Guangyong Chen ◽

Ben Liao ◽

...

Keyword(s):

Neural Networks ◽

Computational Efficiency ◽

Domain Knowledge ◽

Prediction Models ◽

Computational Cost ◽

Large Dataset ◽

Predictive Capacity ◽

Classification Tasks ◽

Graph Neural Networks ◽

Public Datasets

AbstractGraph neural networks (GNN) has been considered as an attractive modelling method for molecular property prediction, and numerous studies have shown that GNN could yield more promising results than traditional descriptor-based methods. In this study, based on 11 public datasets covering various property endpoints, the predictive capacity and computational efficiency of the prediction models developed by eight machine learning (ML) algorithms, including four descriptor-based models (SVM, XGBoost, RF and DNN) and four graph-based models (GCN, GAT, MPNN and Attentive FP), were extensively tested and compared. The results demonstrate that on average the descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency. SVM generally achieves the best predictions for the regression tasks. Both RF and XGBoost can achieve reliable predictions for the classification tasks, and some of the graph-based models, such as Attentive FP and GCN, can yield outstanding performance for a fraction of larger or multi-task datasets. In terms of computational cost, XGBoost and RF are the two most efficient algorithms and only need a few seconds to train a model even for a large dataset. The model interpretations by the SHAP method can effectively explore the established domain knowledge for the descriptor-based models. Finally, we explored use of these models for virtual screening (VS) towards HIV and demonstrated that different ML algorithms offer diverse VS profiles. All in all, we believe that the off-the-shelf descriptor-based models still can be directly employed to accurately predict various chemical endpoints with excellent computability and interpretability.

Download Full-text