scholarly journals Mengatasi Imbalanced Class Pada Software Defect Prediction Menggunakan Two-Step Clustering-Based Undersampling dan Bagging Tehcnique

2019 ◽  
Vol 6 (1) ◽  
pp. 107-113
Author(s):  
Muhammad Faittullah Akbar ◽  
Ilham Kurniawan ◽  
Ahmad Fauzi

Ketidakseimbangan kelas seringkali menjadi masalah di berbagai set data dunia nyata, di mana satu kelas (yaitu kelas minoritas) berisi sejumlah kecil titik data dan yang lainnya (yaitu kelas mayoritas) berisi sejumlah besar titik data. Sangat sulit untuk mengembangkan model yang efektif dengan menggunakan data mining dan algoritma machine learning tanpa mempertimbangkan preprocessing data untuk menyeimbangkan set data yang tidak seimbang. Random undersampling dan oversampling telah digunakan dalam banyak penelitian untuk memastikan bahwa kelas yang berbeda mengandung jumlah titik data yang sama. Dalam penelitian ini, kami mengusulkan kombinasi two-step clustering-based random undersampling dan bagging technique untuk meningkatkan nilai akurasi software defect prediction. Metode yang diusulkan dievaluasi menggunakan lima set data dari repositori program data metrik NASA dan area under the curve (AUC) sebagai evaluasi utama. Hasil telah menunjukkan bahwa metode yang diusulkan menghasilkan kinerja yang sangat baik untuk semua dataset (AUC> 0,9). Dalam hal SN, percobaan kedua mengungguli percobaan pertama di hampir semua dataset (3 dari 5 dataset). Sementara itu, dalam hal SP, percobaan pertama tidak mengungguli percobaan kedua di semua dataset. Secara keseluruhan percobaan kedua mengungguli dan lebih baik daripada percobaan pertama karena evaluasi utama dalam klasifikasi kelas yang tidak seimbang seperti SDP adalah AUC Oleh karena itu, dapat disimpulkan bahwa metode yang diusulkan menghasilkan kinerja yang optimal baik untuk set data skala kecil maupun besar. 

2017 ◽  
Vol 79 (7-2) ◽  
Author(s):  
Adi Wijaya ◽  
Romi Satria Wahono

The cost of finding and correcting the software defects are high and increases exponentially in the software development. The software defect prediction (SDP) can be used in the early phases to reduce the testing and maintenance time, cost and effort; thus, improves the quality of the software. SDP performance is poor caused by imbalanced class in datasets where defective modules as minority compared to defect-free ones. In this study, we propose the combination of random undersampling based on two-step cluster and stacking technique for improving the accuracy of SDP. In stacking technique, Decision Tree, Logistic Regression and k-Nearest Neighbor are used as base learner while Naive Bayes as stacking model learner. The proposed method is evaluated using nine datasets from NASA metrics data program repository and area under curve (AUC) as main evaluation. Results have indicated that the proposed method yield excellent performance for 5 of 9 datasets (AUC > 0.9). Compared to the prior researches, the proposed method has first position for 3 datasets, second position for 5 datasets and only 1 dataset in third position for AUC value comparison. Therefore, it can be concluded that the proposed method has an impressive and promising result in prediction performance for most datasets compared with prior research performance.


Author(s):  
Md Nasir Uddin ◽  
Bixin Li ◽  
Md Naim Mondol ◽  
Md Mostafizur Rahman ◽  
Md Suman Mia ◽  
...  

2020 ◽  
Vol 15 (1) ◽  
pp. 35-42
Author(s):  
A.O. Balogun ◽  
A.O. Bajeh ◽  
H.A. Mojeed ◽  
A.G. Akintola

Failure of software systems as a result of software testing is very much rampant as modern software systems are large and complex. Software testing which is an integral part of the software development life cycle (SDLC), consumes both human and capital resources. As such, software defect prediction (SDP) mechanisms are deployed to strengthen the software testing phase in SDLC by predicting defect prone modules or components in software systems. Machine learning models are used for developing the SDP models with great successes achieved. Moreover, some studies have highlighted that a combination of machine learning models as a form of an ensemble is better than single SDP models in terms of prediction accuracy. However, the efficiency of machine learning models can change with diverse predictive evaluation metrics. Thus, more studies are needed to establish the effectiveness of ensemble SDP models over single SDP models. This study proposes the deployment of Multi-Criteria Decision Method (MCDM) techniques to rank machine learning models. Analytic Network Process (ANP) and Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE) which are types of MCDM techniques are deployed on 9 machine learning models with 11 performance evaluation metrics and 11 software defects datasets. The experimental results showed that ensemble SDP models are best appropriate SDP models as Boosted SMO and Boosted PART ranked highest for each of the MCDM techniques. Besides, the experimental results also validated the stand of not considering accuracy as the only performance evaluation metrics for SDP models. Conclusively, more performance metrics other than predictive accuracy should be considered when ranking and evaluating machine learning models. Keywords: Ensemble; Multi-Criteria Decision Method; Software Defect Prediction


2020 ◽  
Vol 17 (5) ◽  
pp. 721-730
Author(s):  
Kamal Bashir ◽  
Tianrui Li ◽  
Mahama Yahaya

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process


Sign in / Sign up

Export Citation Format

Share Document