Mengatasi Imbalanced Class Pada Software Defect Prediction Menggunakan Two-Step Clustering-Based Undersampling dan Bagging Tehcnique

Ketidakseimbangan kelas seringkali menjadi masalah di berbagai set data dunia nyata, di mana satu kelas (yaitu kelas minoritas) berisi sejumlah kecil titik data dan yang lainnya (yaitu kelas mayoritas) berisi sejumlah besar titik data. Sangat sulit untuk mengembangkan model yang efektif dengan menggunakan data mining dan algoritma machine learning tanpa mempertimbangkan preprocessing data untuk menyeimbangkan set data yang tidak seimbang. Random undersampling dan oversampling telah digunakan dalam banyak penelitian untuk memastikan bahwa kelas yang berbeda mengandung jumlah titik data yang sama. Dalam penelitian ini, kami mengusulkan kombinasi two-step clustering-based random undersampling dan bagging technique untuk meningkatkan nilai akurasi software defect prediction. Metode yang diusulkan dievaluasi menggunakan lima set data dari repositori program data metrik NASA dan area under the curve (AUC) sebagai evaluasi utama. Hasil telah menunjukkan bahwa metode yang diusulkan menghasilkan kinerja yang sangat baik untuk semua dataset (AUC> 0,9). Dalam hal SN, percobaan kedua mengungguli percobaan pertama di hampir semua dataset (3 dari 5 dataset). Sementara itu, dalam hal SP, percobaan pertama tidak mengungguli percobaan kedua di semua dataset. Secara keseluruhan percobaan kedua mengungguli dan lebih baik daripada percobaan pertama karena evaluasi utama dalam klasifikasi kelas yang tidak seimbang seperti SDP adalah AUC Oleh karena itu, dapat disimpulkan bahwa metode yang diusulkan menghasilkan kinerja yang optimal baik untuk set data skala kecil maupun besar.

Download Full-text

TACKLING IMBALANCED CLASS IN SOFTWARE DEFECT PREDICTION USING TWO-STEP CLUSTER BASED RANDOM UNDERSAMPLING AND STACKING TECHNIQUE

Jurnal Teknologi ◽

10.11113/jt.v79.11874 ◽

2017 ◽

Vol 79 (7-2) ◽

Cited By ~ 1

Author(s):

Adi Wijaya ◽

Romi Satria Wahono

Keyword(s):

Nearest Neighbor ◽

Promising Result ◽

Research Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

K Nearest Neighbor ◽

Software Defect ◽

Random Undersampling ◽

Imbalanced Class ◽

Data Program

The cost of finding and correcting the software defects are high and increases exponentially in the software development. The software defect prediction (SDP) can be used in the early phases to reduce the testing and maintenance time, cost and effort; thus, improves the quality of the software. SDP performance is poor caused by imbalanced class in datasets where defective modules as minority compared to defect-free ones. In this study, we propose the combination of random undersampling based on two-step cluster and stacking technique for improving the accuracy of SDP. In stacking technique, Decision Tree, Logistic Regression and k-Nearest Neighbor are used as base learner while Naive Bayes as stacking model learner. The proposed method is evaluated using nine datasets from NASA metrics data program repository and area under curve (AUC) as main evaluation. Results have indicated that the proposed method yield excellent performance for 5 of 9 datasets (AUC > 0.9). Compared to the prior researches, the proposed method has first position for 3 datasets, second position for 5 datasets and only 1 dataset in third position for AUC value comparison. Therefore, it can be concluded that the proposed method has an impressive and promising result in prediction performance for most datasets compared with prior research performance.

Download Full-text

A Study on Software Metrics based Software Defect Prediction using Data Mining and Machine Learning Techniques

International Journal of Database Theory and Application ◽

10.14257/ijdta.2015.8.3.15 ◽

2015 ◽

Vol 8 (3) ◽

pp. 179-190 ◽

Cited By ~ 15

Author(s):

Manjula.C.M. Prasad ◽

Lilly Florence Florence ◽

Arti Arya3

Keyword(s):

Machine Learning ◽

Data Mining ◽

Software Metrics ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Using Data

Download Full-text

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

10.1109/icscc51209.2021.9528170 ◽

2021 ◽

Author(s):

Sushant Kumar Pandey ◽

Anil Kumar Tripathi

Keyword(s):

Machine Learning ◽

Empirical Study ◽

Prediction Models ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Defect Prediction Models

Download Full-text

SDP-ML: An Automated Approach of Software Defect Prediction employing Machine Learning Techniques

10.1109/icecit54077.2021.9641218 ◽

2021 ◽

Author(s):

Md Nasir Uddin ◽

Bixin Li ◽

Md Naim Mondol ◽

Md Mostafizur Rahman ◽

Md Suman Mia ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques

Download Full-text

Software Defect Prediction in Class Level Metric Aggregation Using Data Mining Techniques

Research Journal of Applied Sciences Engineering and Technology ◽

10.19026/rjaset.13.3014 ◽

2016 ◽

Vol 13 (7) ◽

pp. 544-568

Author(s):

Reddi Kiran Kumar ◽

S.V. Achuta Rao

Keyword(s):

Data Mining ◽

Defect Prediction ◽

Software Defect Prediction ◽

Data Mining Techniques ◽

Software Defect ◽

Class Level ◽

Using Data

Download Full-text

Software Defect Prediction based on Machine Learning Algorithms

2019 IEEE 5th International Conference on Computer and Communications (ICCC) ◽

10.1109/iccc47050.2019.9064412 ◽

2019 ◽

Author(s):

Zhang Tian ◽

Jing Xiang ◽

Sun Zhenxiao ◽

Zhang Yi ◽

Yan Yunqiang

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

Software defect prediction: A multi-criteria decision-making approach

Nigerian Journal of Technological Research ◽

10.4314/njtr.v15i1.7 ◽

2020 ◽

Vol 15 (1) ◽

pp. 35-42

Author(s):

A.O. Balogun ◽

A.O. Bajeh ◽

H.A. Mojeed ◽

A.G. Akintola

Keyword(s):

Machine Learning ◽

Software Testing ◽

Evaluation Metrics ◽

Defect Prediction ◽

Software Systems ◽

Software Defect Prediction ◽

Learning Models ◽

Decision Method ◽

Software Defect ◽

Machine Learning Models

Failure of software systems as a result of software testing is very much rampant as modern software systems are large and complex. Software testing which is an integral part of the software development life cycle (SDLC), consumes both human and capital resources. As such, software defect prediction (SDP) mechanisms are deployed to strengthen the software testing phase in SDLC by predicting defect prone modules or components in software systems. Machine learning models are used for developing the SDP models with great successes achieved. Moreover, some studies have highlighted that a combination of machine learning models as a form of an ensemble is better than single SDP models in terms of prediction accuracy. However, the efficiency of machine learning models can change with diverse predictive evaluation metrics. Thus, more studies are needed to establish the effectiveness of ensemble SDP models over single SDP models. This study proposes the deployment of Multi-Criteria Decision Method (MCDM) techniques to rank machine learning models. Analytic Network Process (ANP) and Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE) which are types of MCDM techniques are deployed on 9 machine learning models with 11 performance evaluation metrics and 11 software defects datasets. The experimental results showed that ensemble SDP models are best appropriate SDP models as Boosted SMO and Boosted PART ranked highest for each of the MCDM techniques. Besides, the experimental results also validated the stand of not considering accuracy as the only performance evaluation metrics for SDP models. Conclusively, more performance metrics other than predictive accuracy should be considered when ranking and evaluating machine learning models. Keywords: Ensemble; Multi-Criteria Decision Method; Software Defect Prediction

Download Full-text

The Effects of Parameter Tuning on Machine Learning Performance in a Software Defect Prediction Context

10.33915/etd.6457 ◽

2015 ◽

Author(s):

Benjamin N. Province

Keyword(s):

Machine Learning ◽

Parameter Tuning ◽

Learning Performance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

Software Defect Prediction Using Supervised Machine Learning Techniques: A Systematic Literature Review

Intelligent Automation & Soft Computing ◽

10.32604/iasc.2021.017562 ◽

2021 ◽

Vol 29 (2) ◽

pp. 403-421

Author(s):

Faseeha Matloob ◽

Shabib Aftab ◽

Munir Ahmad ◽

Muhammad Adnan Khan ◽

Areej Fatima ◽

...

Keyword(s):

Machine Learning ◽

Literature Review ◽

Systematic Literature Review ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques

Download Full-text

A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/5/5 ◽

2020 ◽

Vol 17 (5) ◽

pp. 721-730

Author(s):

Kamal Bashir ◽

Tianrui Li ◽

Mahama Yahaya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Feature Selection ◽

Maximum Likelihood ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect ◽

Optimal Feature Subset ◽

Optimal Feature

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process

Download Full-text