Inter-node Hellinger Distance based Decision Tree

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539320400112 ◽

2020 ◽

Vol 27 (05) ◽

pp. 2040011

Author(s):

Ruchika Malhotra ◽

Kusum Lata

Keyword(s):

Software Maintenance ◽

Software Metrics ◽

Performance Metrics ◽

Class Imbalance ◽

Maintenance Cost ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalance Problem ◽

Ensemble Techniques

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.

Download Full-text

Hellinger distance based oversampling method to solve multi-class imbalance problem

2017 7th International Conference on Communication Systems and Network Technologies (CSNT) ◽

10.1109/csnt.2017.8418525 ◽

2017 ◽

Cited By ~ 1

Author(s):

Amisha Kumari ◽

Urjita Thakar

Keyword(s):

Class Imbalance ◽

Hellinger Distance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems

Applied Sciences ◽

10.3390/app11146310 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6310

Author(s):

Ismael Lin ◽

Octavio Loyola-González ◽

Raúl Monroy ◽

Miguel Angel Medina-Pérez

Keyword(s):

Pattern Recognition ◽

Real World ◽

State Of The Art ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Real World Data ◽

Minority Class ◽

Research Directions ◽

Imbalance Problem ◽

Medical Diagnostic

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.

Download Full-text

Analysis of SMOTE

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2021040102 ◽

2021 ◽

Vol 11 (2) ◽

pp. 15-37

Author(s):

Ankita Bansal ◽

Makul Saini ◽

Rakshit Singh ◽

Jai Kumar Yadav

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Decision Tree ◽

Naive Bayes ◽

Class Imbalance ◽

Nearest Neighbour ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Tremendous Amount

The tremendous amount of data generated through IoT can be imbalanced causing class imbalance problem (CIP). CIP is one of the major issues in machine learning where most of the samples belong to one of the classes, thus producing biased classifiers. The authors in this paper are working on four imbalanced datasets belonging to diverse domains. The objective of this study is to deal with CIP using oversampling techniques. One of the commonly used oversampling approaches is synthetic minority oversampling technique (SMOTE). In this paper, the authors have suggested modifications in SMOTE and proposed their own algorithm, SMOTE-modified (SMOTE-M). To provide a fair evaluation, it is compared with three oversampling approaches, SMOTE, adaptive synthetic oversampling (ADASYN), and SMOTE-Adaboost. To evaluate the performances of sampling approaches, models are constructed using four classifiers (K-nearest neighbour, decision tree, naive Bayes, logistic regression) on balanced and imbalanced datasets. The study shows that the results of SMOTE-M are comparable to that of ADASYN and SMOTE-Adaboost.

Download Full-text

Hybrid ensembles of decision trees and Bayesian network for class imbalance problem

2016 8th International Conference on Knowledge and Smart Technology (KST) ◽

10.1109/kst.2016.7440523 ◽

2016 ◽

Cited By ~ 2

Author(s):

Pumitara Ruangthong ◽

Saichon Jaiyen

Keyword(s):

Bayesian Network ◽

Decision Trees ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch039 ◽

2020 ◽

pp. 740-772

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

SYMBOLIC ONE-CLASS LEARNING FROM IMBALANCED DATASETS: APPLICATION IN MEDICAL DIAGNOSIS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213009000135 ◽

2009 ◽

Vol 18 (02) ◽

pp. 273-309 ◽

Cited By ~ 25

Author(s):

LUIS MENA ◽

JESUS A. GONZALEZ

Keyword(s):

Medical Diagnosis ◽

Characteristic Curve ◽

Geometric Mean ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalance Problem ◽

Quantitative Metrics ◽

Medical Dataset

When working with real-world applications we often find imbalanced datasets, those for which there exists a majority class with normal data and a minority class with abnormal or important data. In this work, we make an overview of the class imbalance problem; we review consequences, possible causes and existing strategies to cope with the inconveniences associated to this problem. As an effort to contribute to the solution of this problem, we propose a new rule induction algorithm named Rule Extraction for MEdical Diagnosis (REMED), as a symbolic one-class learning approach. For the evaluation of the proposed method, we use different medical diagnosis datasets taking into account quantitative metrics, comprehensibility, and reliability. We performed a comparison of REMED versus C4.5 and RIPPER combined with over-sampling and cost-sensitive strategies. This empirical analysis of the REMED algorithm showed it to be quantitatively competitive with C4.5 and RIPPER in terms of the area under the Receiver Operating Characteristic curve (AUC) and the geometric mean, but overcame them in terms of comprehensibility and reliability. Results of our experiments show that REMED generated rules systems with a larger degree of abstraction and patterns closer to well-known abnormal values associated to each considered medical dataset.

Download Full-text

Detection of Myocardial Infarction Using ECG and Multi-Scale Feature Concatenate

Sensors ◽

10.3390/s21051906 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1906

Author(s):

Jia-Zheng Jian ◽

Tzong-Rong Ger ◽

Han-Hua Lai ◽

Chi-Ming Ku ◽

Chiung-An Chen ◽

...

Keyword(s):

Myocardial Infarction ◽

Network Structure ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Multi Scale ◽

Imbalance Problem ◽

Average Accuracy ◽

Significant Difference ◽

Electrocardiogram Ecg

Diverse computer-aided diagnosis systems based on convolutional neural networks were applied to automate the detection of myocardial infarction (MI) found in electrocardiogram (ECG) for early diagnosis and prevention. However, issues, particularly overfitting and underfitting, were not being taken into account. In other words, it is unclear whether the network structure is too simple or complex. Toward this end, the proposed models were developed by starting with the simplest structure: a multi-lead features-concatenate narrow network (N-Net) in which only two convolutional layers were included in each lead branch. Additionally, multi-scale features-concatenate networks (MSN-Net) were also implemented where larger features were being extracted through pooling the signals. The best structure was obtained via tuning both the number of filters in the convolutional layers and the number of inputting signal scales. As a result, the N-Net reached a 95.76% accuracy in the MI detection task, whereas the MSN-Net reached an accuracy of 61.82% in the MI locating task. Both networks give a higher average accuracy and a significant difference of p < 0.001 evaluated by the U test compared with the state-of-the-art. The models are also smaller in size thus are suitable to fit in wearable devices for offline monitoring. In conclusion, testing throughout the simple and complex network structure is indispensable. However, the way of dealing with the class imbalance problem and the quality of the extracted features are yet to be discussed.

Download Full-text

Inter-node Hellinger Distance based Decision Tree

Building Decision Trees for the Multi-class Imbalance Problem

Improving Logging Prediction on Imbalanced Datasets

Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software

Hellinger distance based oversampling method to solve multi-class imbalance problem

A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems

Analysis of SMOTE

Hybrid ensembles of decision trees and Bayesian network for class imbalance problem

Improving Logging Prediction on Imbalanced Datasets

SYMBOLIC ONE-CLASS LEARNING FROM IMBALANCED DATASETS: APPLICATION IN MEDICAL DIAGNOSIS

Detection of Myocardial Infarction Using ECG and Multi-Scale Feature Concatenate

Export Citation Format