Empirical Studies of a Kernel Density Estimation Based Naive Bayes Method for Software Defect Prediction

Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classifiers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.

Download Full-text

Integrasi SMOTE pada Naive Bayes dan Logistic Regression Berbasis Particle Swarm Optimization untuk Prediksi Cacat Perangkat Lunak

JURNAL MEDIA INFORMATIKA BUDIDARMA ◽

10.30865/mib.v5i1.2616 ◽

2021 ◽

Vol 5 (1) ◽

pp. 233

Author(s):

Andre Hardoni ◽

Dian Palupi Rini ◽

Sukemi Sukemi

Keyword(s):

Logistic Regression ◽

Particle Swarm Optimization ◽

Naive Bayes ◽

Class Imbalance ◽

Naïve Bayes ◽

Defect Prediction ◽

Software Defect Prediction ◽

Swarm Optimization ◽

Software Defect ◽

Classification Technique

Software defects are one of the main contributors to information technology waste and lead to rework, thus consuming a lot of time and money. Software defect prediction has the objective of defect prevention by classifying certain modules as defective or not defective. Many researchers have conducted research in the field of software defect prediction using NASA MDP public datasets, but these datasets still have shortcomings such as class imbalance and noise attribute. The class imbalance problem can be overcome by utilizing SMOTE (Synthetic Minority Over-sampling Technique) and the noise attribute problem can be solved by selecting features using Particle Swarm Optimization (PSO), So in this research, the integration between SMOTE and PSO is applied to the classification technique machine learning naïve Bayes and logistic regression. From the results of experiments that have been carried out on 8 NASA MDP datasets by dividing the dataset into training and testing data, it is found that the SMOTE + PSO integration in each classification technique can improve classification performance with the highest AUC (Area Under Curve) value on average 0,89 on logistic regression and 0,86 in naïve Bayes in the training and at the same time better than without combining the two.

Download Full-text

Optimization of Hyper Parameter Bandwidth on Naïve Bayes Kernel Density Estimation for the Breast Cancer Classification

2019 International Conference on Information and Communications Technology (ICOIACT) ◽

10.1109/icoiact46704.2019.8938497 ◽

2019 ◽

Author(s):

Theopilus Bayu Sasongko ◽

Oki Arifin ◽

Hanif Al Fatta

Keyword(s):

Breast Cancer ◽

Density Estimation ◽

Kernel Density Estimation ◽

Naive Bayes ◽

Kernel Density ◽

Naïve Bayes ◽

Cancer Classification ◽

Breast Cancer Classification

Download Full-text

Two-level feature selection for naive bayes with kernel density estimation in question classification based on Bloom's cognitive levels

2013 International Conference on Information Technology and Electrical Engineering (ICITEE) ◽

10.1109/iciteed.2013.6676245 ◽

2013 ◽

Cited By ~ 4

Author(s):

Catur Supriyanto ◽

Norazah Yusof ◽

Bowo Nurhadiono ◽

Sukardi

Keyword(s):

Feature Selection ◽

Density Estimation ◽

Kernel Density Estimation ◽

Naive Bayes ◽

Kernel Density ◽

Naïve Bayes ◽

Cognitive Levels ◽

Question Classification ◽

Selection For

Download Full-text

Software Defect Prediction Using Software Metrics with Naïve Bayes and Rule Mining Association Methods

2019 5th International Conference on Science and Technology (ICST) ◽

10.1109/icst47872.2019.9166448 ◽

2019 ◽

Author(s):

Fernando Maruli Tua ◽

Wikan Danar Sunindyo

Keyword(s):

Software Metrics ◽

Naive Bayes ◽

Naïve Bayes ◽

Defect Prediction ◽

Software Defect Prediction ◽

Rule Mining ◽

Software Defect

Download Full-text

Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites

Bioinformatics ◽

10.1093/bioinformatics/btq302 ◽

2010 ◽

Vol 26 (15) ◽

pp. 1841-1848 ◽

Cited By ~ 113

Author(s):

Yoichi Murakami ◽

Kenji Mizuguchi

Keyword(s):

Density Estimation ◽

Protein Interaction ◽

Kernel Density Estimation ◽

Naive Bayes ◽

Kernel Density ◽

Naïve Bayes ◽

Bayes Classifier ◽

Protein Protein Interaction ◽

Interaction Sites ◽

Protein Interaction Sites

Download Full-text

Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem

International Journal of Information Technology and Business ◽

10.24246/ijiteb.112018.36-41 ◽

2018 ◽

Vol 1 (1) ◽

pp. 36-41

Author(s):

Joko Suntoro ◽

Febrian Wahyu Christanto ◽

Henny Indriyawati

Keyword(s):

Naive Bayes ◽

High Dimensional Data ◽

Naïve Bayes ◽

High Dimensional ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defects ◽

Bayesian Algorithm ◽

Software Defect ◽

Bayes Algorithm

The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696.

Download Full-text