Software Defect Density Analysis

Mapping Intimacies ◽

10.29007/rh9l ◽

2019 ◽

Author(s):

Cuauhtémoc López-Martín

Keyword(s):

Prediction Models ◽

Defect Density ◽

Data Sets ◽

Software Projects ◽

Data Set ◽

Language Generation ◽

Development Platform ◽

Software Defect ◽

Density Analysis ◽

Lower Size

Defect density (DD) is a measure to determine the effectiveness of software processes. DD is defined as the total number of defects divided by the size of the software. Software prediction is an activity of software planning. This study is related to the analysis of attributes of data sets commonly used for building DD prediction models. The data sets of software projects were selected from the International Software Benchmarking Standards Group (ISBSG) Release 2018. The selection criteria were based on attributes such as type of development, development platform, and programming language generation as suggested by the ISBSG. Since a lower size of data set is generated as mentioned criteria are observed, it avoids a good generalization for models. Therefore, in this study, a statistical analysis of data sets was performed with the objective of knowing if they could be pooled instead of using them as separated data sets. Results showed that there was no difference among the DD of new projects nor among the DD of enhancement projects, but there was a difference between the DD of new and enhancement projects. Results suggest that prediction models can separately be constructed for new projects and enhancement projects, but not by pooling new and enhancement ones.

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Scalable Micro-planned Generation of Discourse from Structured Data

Computational Linguistics ◽

10.1162/coli_a_00363 ◽

2020 ◽

Vol 45 (4) ◽

pp. 737-763 ◽

Cited By ~ 1

Author(s):

Anirban Laha ◽

Parag Jain ◽

Abhijit Mishra ◽

Karthik Sankaranarayanan

Keyword(s):

Natural Language ◽

Structured Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Language Generation ◽

Parallel Data ◽

Simple Sentences ◽

Diverse Data ◽

Existing Data

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically use end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. Rather, it relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system utilizes a three-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent, and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain data set curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular data sets covering diverse data types such as knowledge graphs and key-value maps.

Unassisted Noise-Reduction of Chemical Reactions Data Sets

10.26434/chemrxiv.12395120.v2 ◽

2021 ◽

Author(s):

Alessandra Toniato ◽

Philippe Schwaller ◽

Antonio Cardinale ◽

Joppe Geluykens ◽

Teodoro Laino

Keyword(s):

Noise Reduction ◽

Chemical Reactions ◽

Language Processing ◽

Prediction Models ◽

Open Data ◽

Data Sets ◽

Data Set ◽

Intelligence Models ◽

Artificial Intelligence Models ◽

Jensen Shannon Divergence

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones).With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value ofthe cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.

A Further Investigation on the Application of Critical Pore Size as an Approach for Reservoir Rock Typing

Journal of Energy Resources Technology ◽

10.1115/1.4049735 ◽

2021 ◽

Vol 143 (11) ◽

Author(s):

Mohsen Faramarzi-Palangar ◽

Behnam Sedaee ◽

Mohammad Emami Niri

Keyword(s):

Pore Size ◽

Constant Coefficient ◽

Prediction Models ◽

Critical Role ◽

Development Planning ◽

Reservoir Rock ◽

Data Sets ◽

Data Set ◽

Rock Typing ◽

Correct Definition

Abstract The correct definition of rock types plays a critical role in reservoir characterization, simulation, and field development planning. In this study, we use the critical pore size (linf) as an approach for reservoir rock typing. Two linf relations were separately derived based on two permeability prediction models and then merged together to drive a generalized linf relation. The proposed rock typing methodology includes two main parts: in the first part, we determine an appropriate constant coefficient, and in the second part, we perform reservoir rock typing based on two different scenarios. The first scenario is based on the forming groups of rocks using statistical analysis, and the second scenario is based on the forming groups of rocks with similar capillary pressure curves. This approach was applied to three data sets. In detail, two data sets were used to determine the constant coefficient, and one data set was used to show the applicability of the linf method in comparison with FZI for rock typing.

Research and Appalication of Software Defect Predictionn based on BP-Migration learning

MATEC Web of Conferences ◽

10.1051/matecconf/201823203017 ◽

2018 ◽

Vol 232 ◽

pp. 03017

Author(s):

Jie Zhang ◽

Gang Wang ◽

Haobo Jiang ◽

Fangzheng Zhao ◽

Guilin Tian

Keyword(s):

Prediction Model ◽

Historical Data ◽

Defect Prediction ◽

Software Project ◽

Data Sets ◽

Software Defect Prediction ◽

Software Module ◽

Data Set ◽

Software Defect ◽

Project Data

Software Defect Prediction has been an important part of Software engineering research since the 1970s. This technique is used to calculate and analyze the measurement and defect information of the historical software module to complete the defect prediction of the new software module. Currently, most software defect prediction model is established on the basis of the same software project data set. The training date sets used to construct the model and the test data sets used to validate the model are from the same software projects. But in practice, for those has less historical data of a software project or new projects, the defect of traditional prediction method shows lower forecast performance. For the traditional method, when the historical data is insufficient, the software defect prediction model cannot be fully studied. It is difficult to achieve high prediction accuracy. In the process of cross-project prediction, the problem that we will faced is data distribution differences. For the above problems, this paper presents a software defect prediction model based on migration learning and traditional software defect prediction model. This model uses the existing project data sets to predict software defects across projects. The main work of this article includes: 1) Data preprocessing. This section includes data feature correlation analysis, noise reduction and so on, which effectively avoids the interference of over-fitting problem and noise data on prediction results. 2) Migrate learning. This section analyzes two different but related project data sets and reduces the impact of data distribution differences. 3) Artificial neural networks. According to class imbalance problems of the data set, using artificial neural network and dynamic selection training samples reduce the influence of prediction results because of the positive and negative samples data. The data set of the Relink project and AEEEM is studied to evaluate the performance of the f-measure and the ROC curve and AUC calculation. Experiments show that the model has high predictive performance.

Genomic Prediction Enhanced Sparse Testing for Multi-environment Trials

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401349 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2725-2739 ◽

Cited By ~ 1

Author(s):

Diego Jarquin ◽

Reka Howard ◽

Jose Crossa ◽

Yoseph Beyene ◽

Manje Gowda ◽

...

Keyword(s):

Prediction Accuracy ◽

Prediction Models ◽

Prediction Method ◽

Predictive Ability ◽

Fixed Number ◽

Substantial Part ◽

Data Sets ◽

Environment Interaction ◽

Data Set ◽

Hybrid Data

“Sparse testing” refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.

Research on Software Defect Prediction Method Based on Machine Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.2182 ◽

2014 ◽

Vol 687-691 ◽

pp. 2182-2185 ◽

Cited By ~ 1

Author(s):

Wei Zhang ◽

Zhen Yu Ma ◽

Qing Ling Lu ◽

Xiao Bing Nie ◽

Juan Liu

Keyword(s):

Machine Learning ◽

Correlation Analysis ◽

Software Metrics ◽

Defect Density ◽

Rank Correlation ◽

Prediction Method ◽

Learning Models ◽

Software Projects ◽

Software Defect ◽

Machine Learning Models

This paper analyzed 44 metrics of application level, file level, class level and function level, and do correlation analysis with the number of software defects and defect density, the results show that software metrics have little correlation with the number of software defect, but are correlative with defect density. Through correlation analysis, we selected five metrics that have larger correlation with defect density. On the basis of feature selection, we predicted defect density with 16 machine learning models for 33 actual software projects. The results show that the Spearman rank correlation coefficient (SRCC) between the predicting defect density and the actual defect density based on SVR model is 0.6727, higher than other 15 machine learning models, the model that has the second absolute value of SRCC is IBk model, the SRCC only is-0.3557, the results show that the method based on SVR has the highest prediction accuracy.

Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning

Nature Communications ◽

10.1038/s41467-019-13297-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 18

Author(s):

Dipendra Jha ◽

Kamal Choudhary ◽

Francesca Tavazza ◽

Wei-keng Liao ◽

Alok Choudhary ◽

...

Keyword(s):

Experimental Data ◽

Transfer Learning ◽

Density Functional ◽

Prediction Models ◽

Absolute Error ◽

Search Space ◽

Data Sets ◽

Data Set ◽

Dft Computations ◽

Robust Prediction

AbstractThe current predictive modeling techniques applied to Density Functional Theory (DFT) computations have helped accelerate the process of materials discovery by providing significantly faster methods to scan materials candidates, thereby reducing the search space for future DFT computations and experiments. However, in addition to prediction error against DFT-computed properties, such predictive models also inherit the DFT-computation discrepancies against experimentally measured properties. To address this challenge, we demonstrate that using deep transfer learning, existing large DFT-computational data sets (such as the Open Quantum Materials Database (OQMD)) can be leveraged together with other smaller DFT-computed data sets as well as available experimental observations to build robust prediction models. We build a highly accurate model for predicting formation energy of materials from their compositions; using an experimental data set of $$1,643$$1,643 observations, the proposed approach yields a mean absolute error (MAE) of $$0.07$$0.07 eV/atom, which is significantly better than existing machine learning (ML) prediction modeling based on DFT computations and is comparable to the MAE of DFT-computation itself.

Interrelated Decision-Making Model for Diabetes

Asia-Pacific Journal of Information Technology and Multimedia ◽

10.17576/apjitm-2021-1002-12 ◽

2021 ◽

Vol 10 (02) ◽

pp. 170-186

Author(s):

Normadiah Mahiddin ◽

Zulaiha Ali Othman ◽

Nur Arzuar Abdul Rahim

Keyword(s):

Decision Making ◽

Prediction Models ◽

Forecast Accuracy ◽

Forecast Model ◽

Data Sets ◽

Dynamic Prediction ◽

Data Set ◽

Care Level ◽

Levels Of Care ◽

Decision Making Model

Diabetes is one of the growing chronic diseases. Proper treatment is needed to produce its effects. Past studies have proposed an Interrelated Decision-making Model (IDM) as an intelligent decision support system (IDSS) solution for healthcare. This model can provide accurate results in determining the treatment of a particular patient. Therefore, the purpose of this study is to develop a diabetic IDM to see the increased decision-making accuracy with the IDM concept. The IDM concept allows the amount of data to increase with the addition of data records at the same level of care, and the addition of data records and attributes from the previous or subsequent levels of care. The more data or information, the more accurate a decision can be made. Data were developed to make diagnostic predictions for each stage of care in the development of type 2 diabetes. The development of data for each stage of care was confirmed by specialists. However, the experiments were performed using simulation data for two stages of care only. Four data sets of different sizes were provided to view changes in forecast accuracy. Each data set contained 2 data sets of primary care level and secondary care level with 4 times the change of the number of attributes from 25 to 58 and the number of records from 300 to 11,000. Data were developed to predict the level of diabetes confirmed by specialist doctors. The experimental results showed that on average, the J48 algorithm showed the best model (99%) followed by Logistics (98%), RandomTree (95%), NaiveBayes Updateable (93%), BayesNet (84%) and AdaBoostM1 (67%). Ratio analysis also showed that the accuracy of the forecast model has increased up to 49%. The MAPKB model for the care of diabetes is designed with data change criteria dynamically and is able to develop the latest dynamic prediction models effectively.v

A COMPARATIVE STUDY OF ABSENT FEATURES AND UNOBSERVED VALUES IN SOFTWARE EFFORT DATA

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194012400025 ◽

2012 ◽

Vol 22 (02) ◽

pp. 185-202 ◽

Cited By ~ 7

Author(s):

WEN ZHANG ◽

YE YANG ◽

QING WANG

Keyword(s):

Input Data ◽

Missing Values ◽

Standard Group ◽

Data Sets ◽

Software Projects ◽

Data Set ◽

Root Cause ◽

Group Data ◽

Real Effort ◽

Effort Prediction

Software effort data contains a large amount of missing values of project attributes. The problem of absent features, which occurred recently in machine learning, is often neglected by researchers of software engineering when handling the missingness in software effort data. In essence, absent features (structural missingness) and unobserved values (unstructured missingness) are different cases of missingness although their appearance in the data set are the same. This paper attempts to clarify the root cause of missingness of software effort data. When regarding missingness as absent features, we develop Max-margin regression to predict real effort of software projects. When regarding missingness as unobserved values, we use existing imputation techniques to impute missing values. Then, ε – SVR is used to predict real effort of software projects with the input data sets. Experiments on ISBSG (International Software Benchmarking Standard Group) and CSBSG (Chinese Software Benchmarking Standard Group) data sets demonstrate that, with the tasks of effort prediction, the treatment regarding missingness in software effort data set as unobserved values can produce more desirable performance than that of regarding missingness as absent features. This paper is the first to introduce the concept of absent features to deal with missingness of software effort data.