Analysis of SMOTE

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch039 ◽

2020 ◽

pp. 740-772

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Identifying student behavior in MOOCs using Machine Learning

International Journal for Innovation Education and Research ◽

10.31686/ijier.vol7.iss3.1318 ◽

2019 ◽

Vol 7 (3) ◽

pp. 30-39 ◽

Cited By ~ 1

Author(s):

Vanessa Faria De Souza ◽

Gabriela Perry

Keyword(s):

Machine Learning ◽

Literature Review ◽

Student Behavior ◽

Class Imbalance ◽

External Factors ◽

Class Imbalance Problem ◽

Data Manipulation ◽

Imbalance Problem ◽

Student Classification

This paper presents the results literature review, carried out with the objective of identifying prevalent research goals and challenges in the prediction of student behavior in MOOCs, using Machine Learning. The results allowed recognizingthree goals: 1. Student Classification and 2. Dropout prediction. Regarding the challenges, five items were identified: 1. Incompatibility of AVAs, 2. Complexity of data manipulation, 3. Class Imbalance Problem, 4. Influence of External Factors and 5. Difficulty in manipulating data by untrained personnel.

Download Full-text

Decision tree induction based on minority entropy for the class imbalance problem

Pattern Analysis and Applications ◽

10.1007/s10044-016-0533-3 ◽

2016 ◽

Vol 20 (3) ◽

pp. 769-782 ◽

Cited By ~ 15

Author(s):

Kesinee Boonchuay ◽

Krung Sinapiromsaran ◽

Chidchanok Lursinsap

Keyword(s):

Decision Tree ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Decision Tree Induction ◽

Imbalance Problem

Download Full-text

Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539320400112 ◽

2020 ◽

Vol 27 (05) ◽

pp. 2040011

Author(s):

Ruchika Malhotra ◽

Kusum Lata

Keyword(s):

Software Maintenance ◽

Software Metrics ◽

Performance Metrics ◽

Class Imbalance ◽

Maintenance Cost ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalance Problem ◽

Ensemble Techniques

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.

Download Full-text

Inter-node Hellinger Distance based Decision Tree

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/272 ◽

2019 ◽

Cited By ~ 1

Author(s):

Pritom Saha Akash ◽

Md. Eusha Kadir ◽

Amin Ahsan Ali ◽

Mohammad Shoyaib

Keyword(s):

Decision Trees ◽

State Of The Art ◽

Class Imbalance ◽

Hellinger Distance ◽

Child Node ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Weighted Version ◽

Imbalance Problem ◽

Splitting Criterion

This paper introduces a new splitting criterion called Inter-node Hellinger Distance (iHD) and a weighted version of it (iHDw) for constructing decision trees. iHD measures the distance between the parent and each of the child nodes in a split using Hellinger distance. We prove that this ensures the mutual exclusiveness between the child nodes. The weight term in iHDw is concerned with the purity of individual child node considering the class imbalance problem. The combination of the distance and weight term in iHDw thus favors a partition where child nodes are purer and mutually exclusive, and skew insensitive. We perform an experiment over twenty balanced and twenty imbalanced datasets. The results show that decision trees based on iHD win against six other state-of-the-art methods on at least 14 balanced and 10 imbalanced datasets. We also observe that adding the weight to iHD improves the performance of decision trees on imbalanced datasets. Moreover, according to the result of the Friedman test, this improvement is statistically significant compared to other methods.

Download Full-text

Biased support vector machine and weighted-smote in handling class imbalance problem

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v4i1.146 ◽

2018 ◽

Vol 4 (1) ◽

pp. 21 ◽

Cited By ~ 21

Author(s):

Hartono Hartono ◽

Opim Salim Sitompul ◽

Tulus Tulus ◽

Erna Budhiarti Nababan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Data Distribution ◽

Class Imbalance ◽

Support Vector ◽

Class Imbalance Problem ◽

Precise Method ◽

Imbalance Problem

Class imbalance occurs when instances in a class are much higher than in other classes. This machine learning major problem can affect the predicted accuracy. Support Vector Machine (SVM) is robust and precise method in handling class imbalance problem but weak in the bias data distribution, Biased Support Vector Machine (BSVM) became popular choice to solve the problem. BSVM provide better control sensitivity yet lack accuracy compared to general SVM. This study proposes the integration of BSVM and SMOTEBoost to handle class imbalance problem. Non Support Vector (NSV) sets from negative samples and Support Vector (SV) sets from positive samples will undergo a Weighted-SMOTE process. The results indicate that implementation of Biased Support Vector Machine and Weighted-SMOTE achieve better accuracy and sensitivity.

Download Full-text

An Insight on the Class Imbalance Problem and Its Solutions in Big Data

Large-Scale Data Streaming, Processing, and Blockchain Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-3444-1.ch002 ◽

2021 ◽

pp. 39-49

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Class Imbalance ◽

Classification Problem ◽

Correct Classification ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Methods And Techniques ◽

Conventional Machine ◽

Work Done

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.

Download Full-text

Robust bounded logistic regression in the class imbalance problem

2016 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn.2016.7727367 ◽

2016 ◽

Cited By ~ 2

Author(s):

Guibiao Xu ◽

Bao-Gang Hu ◽

Jose C. Principe

Keyword(s):

Logistic Regression ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

IProCAD: Intelligent Prognosis of Coronary Artery Disease Excluding Angiogram in Patient with Stable Angina

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e3101.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 2032-2040

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Random Forest ◽

Decision Tree ◽

Stable Angina ◽

Naive Bayes ◽

Feature Vector ◽

Naïve Bayes ◽

The Other

Cardiovascular diseases are one of the main causes of mortality in the world. A proper prediction mechanism system with reasonable cost can significantly reduce this death toll in the low-income countries like Bangladesh. For those countries we propose machine learning backed embedded system that can predict possible cardiac attack effectively by excluding the high cost angiogram and incorporating only twelve (12) low cost features which are age, sex, chest pain, blood pressure, cholesterol, blood sugar, ECG results, heart rate, exercise induced angina, old peak, slope, and history of heart disease. Here, two heart disease datasets of own built NICVD (National Institute of Cardiovascular Disease, Bangladesh) patients’, and UCI (University of California Irvin) are used. The overall process comprises into four phases: Comprehensive literature review, collection of stable angina patients’ data through survey questionnaires from NICVD, feature vector dimensionality is reduced manually (from 14 to 12 dimensions), and the reduced feature vector is fed to machine learning based classifiers to obtain a prediction model for the heart disease. From the experiments, it is observed that the proposed investigation using NICVD patient’s data with 12 features without incorporating angiographic disease status to Artificial Neural Network (ANN) shows better classification accuracy of 92.80% compared to the other classifiers Decision Tree (82.50%), Naïve Bayes (85%), Support Vector Machine (SVM) (75%), Logistic Regression (77.50%), and Random Forest (75%) using the 10-fold cross validation. To accommodate small scale training and test data in our experimental environment we have observed the accuracy of ANN, Decision Tree, Naïve Bayes, SVM, Logistic Regression and Random Forest using Jackknife method, which are 84.80%, 71%, 75.10%, 75%, 75.33% and 71.42% respectively. On the other hand, the classification accuracies of the corresponding classifiers are 91.7%, 76.90%, 86.50%, 76.3%, 67.0% and 67.3%, respectively for the UCI dataset with 12 attributes. Whereas the same dataset with 14 attributes including angiographic status shows the accuracies 93.5%, 76.7%, 86.50%, 76.8%, 67.7% and 69.6% for the respective classifiers

Download Full-text