ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Krzysztof Gajowniczek; Tomasz Ząbkowski

doi:10.3390/electronics10060657

ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Electronics ◽

10.3390/electronics10060657 ◽

2021 ◽

Vol 10 (6) ◽

pp. 657

Author(s):

Krzysztof Gajowniczek ◽

Tomasz Ząbkowski

Keyword(s):

Imbalanced Data ◽

Misclassification Cost ◽

Learning Time ◽

Misclassification Costs ◽

Split Point ◽

Speed Up ◽

R Packages ◽

Class Labels ◽

Entropy Functions ◽

Support Cost

This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.

Download Full-text

Interactive Decision Tree Learning and Decision Rule Extraction Based on the ImbTreeEntropy and ImbTreeAUC Packages

Processes ◽

10.3390/pr9071107 ◽

2021 ◽

Vol 9 (7) ◽

pp. 1107

Author(s):

Krzysztof Gajowniczek ◽

Tomasz Ząbkowski

Keyword(s):

Decision Tree ◽

Learning Process ◽

Quality Measures ◽

Misclassification Cost ◽

Rule Based ◽

Generalized Entropy ◽

Split Point ◽

R Packages ◽

Entropy Functions ◽

Support Cost

This paper presents two new R packages ImbTreeEntropy and ImbTreeAUC for building decision trees, including their interactive construction and analysis, which is a highly regarded feature for field experts who want to be involved in the learning process. ImbTreeEntropy functionality includes the application of generalized entropy functions, such as Renyi, Tsallis, Sharma-Mittal, Sharma-Taneja and Kapur, to measure the impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC measures. The contribution of both packages is that thanks to interactive learning, the user is able to construct a new tree from scratch or, if required, the learning phase enables making a decision regarding the optimal split in ambiguous situations, taking into account each attribute and its cut-off. The main difference with existing solutions is that our packages provide mechanisms that allow for analyzing the trees’ structures (several trees simultaneously) that are built after growing and/or pruning. Both packages support cost-sensitive learning by defining a misclassification cost matrix, as well as weight-sensitive learning. Additionally, the tree structure of the model can be represented as a rule-based model, along with the various quality measures, such as support, confidence, lift, conviction, addedValue, cosine, Jaccard and Laplace.

Download Full-text

Two-step fault diagnosis framework for rolling element bearings with imbalanced data based on GSA-WELM and GSA-ELM

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/0954406217728091 ◽

2017 ◽

Vol 232 (16) ◽

pp. 2937-2947 ◽

Cited By ~ 2

Author(s):

Yuan Lan ◽

Xiaohong Han ◽

Weiwei Zong ◽

Xiaojian Ding ◽

Xiaoyan Xiong ◽

...

Keyword(s):

Fault Diagnosis ◽

Extreme Learning Machine ◽

Gravitational Search Algorithm ◽

Imbalanced Data ◽

Rolling Element Bearing ◽

Rolling Element Bearings ◽

Misclassification Cost ◽

Weighted Extreme Learning Machine ◽

Rolling Element ◽

Learning Machine

Rolling element bearings constitute the key parts on rotating machinery, and their fault diagnosis is of great importance. In many real bearing fault diagnosis applications, the number of fault data is much less than the number of normal data, i.e. the data are imbalanced. Many traditional diagnosis methods will get low accuracy because they have a natural tendency to favor the majority class by assuming balanced class distribution or equal misclassification cost. To deal with imbalanced data, in this article, a novel two-step fault diagnosis framework is proposed to diagnose the status of rolling element bearings. Our proposed framework consists of two steps for fault diagnosis, where Step 1 makes use of weighted extreme learning machine in an effort to classify the normal or abnormal categories, and Step 2 further diagnoses the underlying anomaly in detail by using preliminary extreme learning machine. In addition, gravitational search algorithm is applied to further extract the significant features and determine the optimal parameters of the weighted extreme learning machine and extreme learning machine classifiers. The effectiveness of our proposed approach is testified on the raw data collected from the rolling element bearing experiments conducted in our Institute, and the empirical results show that our approach is really fast and can achieve the diagnosis accuracies more than 96%.

Download Full-text

A Survey on Imbalanced Data Handling Techniques for Classification

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/089102021 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1341-1347

Keyword(s):

Real World ◽

Imbalanced Data ◽

Learning Task ◽

High Accuracy ◽

Data Handling ◽

Imbalanced Dataset ◽

Minority Class ◽

Class Labels ◽

Very High ◽

F Measure

Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.

Download Full-text

Two-Stage Approach to Image Classification by Deep Neural Networks

EPJ Web of Conferences ◽

10.1051/epjconf/201817301009 ◽

2018 ◽

Vol 173 ◽

pp. 01009 ◽

Cited By ~ 3

Author(s):

Gennady Ososkov ◽

Pavel Goncharov

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Cloud Services ◽

Activation Functions ◽

Learning Networks ◽

Actual Problem ◽

Learning Time ◽

Autoassociative Neural Network ◽

Speed Up

The paper demonstrates the advantages of the deep learning networks over the ordinary neural networks on their comparative applications to image classifying. An autoassociative neural network is used as a standalone autoencoder for prior extraction of the most informative features of the input data for neural networks to be compared further as classifiers. The main efforts to deal with deep learning networks are spent for a quite painstaking work of optimizing the structures of those networks and their components, as activation functions, weights, as well as the procedures of minimizing their loss function to improve their performances and speed up their learning time. It is also shown that the deep autoencoders develop the remarkable ability for denoising images after being specially trained. Convolutional Neural Networks are also used to solve a quite actual problem of protein genetics on the example of the durum wheat classification. Results of our comparative study demonstrate the undoubted advantage of the deep networks, as well as the denoising power of the autoencoders. In our work we use both GPU and cloud services to speed up the calculations.

Download Full-text

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007326 ◽

2009 ◽

Vol 23 (04) ◽

pp. 687-719 ◽

Cited By ~ 534

Author(s):

YANMIN SUN ◽

ANDREW K. C. WONG ◽

MOHAMED S. KAMEL

Keyword(s):

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Class Distribution ◽

Imbalance Problem ◽

Misclassification Costs ◽

Imbalanced Class Distribution ◽

Classifier Learning

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

Download Full-text

An Empirical Study on the Performance of Cost-Sensitive Boosting Algorithms with Different Levels of Class Imbalance

Mathematical Problems in Engineering ◽

10.1155/2013/761814 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Qing-Yan Yin ◽

Jiang-She Zhang ◽

Chun-Xia Zhang ◽

Sheng-Cai Liu

Keyword(s):

Research Work ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Imbalanced Data Sets ◽

Misclassification Costs ◽

Level Data ◽

Significant Difference ◽

Boosting Algorithms ◽

Serial Algorithms

Cost-sensitive boosting algorithms have proven successful for solving the difficult class imbalance problems. However, the influence of misclassification costs and imbalance level on the algorithm performance is still not clear. The present paper aims to conduct an empirical comparison of six representative cost-sensitive boosting algorithms, including AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3. These algorithms are thoroughly evaluated by a comprehensive suite of experiments, in which nearly fifty thousands classification models are trained on 17 real-world imbalanced data sets. Experimental results show that AdaC serial algorithms generally outperform AdaCost and CSB when dealing with different imbalance level data sets. Furthermore, the optimality of AdaC2 algorithm stands out around the misclassification costs setting:CN=0.7,CP=1, especially for dealing with strongly imbalanced data sets. In the case of data sets with a low-level imbalance, there is no significant difference between the AdaC serial algorithms. In addition, the results indicate that AdaC1 is comparatively insensitive to the misclassification costs, which is consistent with the finding of the preceding research work.

Download Full-text

Applying Cost-Sensitive Extreme Learning Machine and Dissimilarity Integration to Gene Expression Data Classification

Computational Intelligence and Neuroscience ◽

10.1155/2016/8056253 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 12

Author(s):

Yanqiu Liu ◽

Huijuan Lu ◽

Ke Yan ◽

Haixia Xia ◽

Chunlin An

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Imbalanced Data ◽

Expression Data ◽

Classification Problems ◽

High Scale ◽

Rejection Cost ◽

Misclassification Costs ◽

Learning Machine ◽

The Cost

Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.

Download Full-text

A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1291 ◽

2011 ◽

Vol 271-273 ◽

pp. 1291-1296

Author(s):

Jin Wei Zhang ◽

Hui Juan Lu ◽

Wu Tao Chen ◽

Yi Lu

Keyword(s):

Sampling Method ◽

Sampling Methods ◽

Imbalanced Data ◽

Distribution Data ◽

Data Sets ◽

Data Set ◽

Minority Class ◽

Misclassification Cost ◽

Cost Sensitive Learning ◽

Class Distribution

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

Download Full-text

An Evolutionary Misclassification Cost Minimization Approach for Medical Diagnosis

Managing Data Mining Technologies in Organizations ◽

10.4018/978-1-59140-057-8.ch003 ◽

2011 ◽

pp. 32-44 ◽

Cited By ~ 1

Author(s):

Parag C. Pendharkar ◽

Sudhir Nanda ◽

James A. Rodger ◽

Rahul Bhaskar

Keyword(s):

Medical Diagnosis ◽

Cost Minimization ◽

Real Life ◽

Classification Systems ◽

Type I ◽

Type Ii ◽

Type Ii Error ◽

Misclassification Cost ◽

Misclassification Costs ◽

Error Costs

This chapter illustrates how a misclassification cost matrix can be incorporated into an evolutionary classification system for medical diagnosis. Most classification systems for medical diagnosis have attempted to minimize the misclassifications (or maximize correctly classified cases). The minimizing misclassification approach assumes that Type I and Type II error costs for misclassification are equal. There is evidence that these costs are not equal and incorporating costs into classification systems can lead to superior outcomes. We use principles of evolution to develop and test a genetic algorithm (GA) based approach that incorporates the asymmetric Type I and Type II error costs. Using simulated and real-life medical data, we show that the proposed approach, incorporating Type I and Type II misclassification costs, results in lower misclassification costs than LDA and GA approaches that do not incorporate these costs.

Download Full-text

Value of Information Lattice: Exploiting Probabilistic Independence for Effective Feature Subset Acquisition

Journal of Artificial Intelligence Research ◽

10.1613/jair.3200 ◽

2011 ◽

Vol 41 ◽

pp. 69-95 ◽

Cited By ~ 6

Author(s):

M. Bilgic ◽

L. Getoor

Keyword(s):

Value Of Information ◽

Empirical Evaluation ◽

Feature Subset ◽

Multiple Features ◽

Misclassification Cost ◽

Speed Up ◽

Feature Acquisition ◽

Independence Properties ◽

The Right ◽

The Cost

We address the cost-sensitive feature acquisition problem, where misclassifying an instance is costly but the expected misclassification cost can be reduced by acquiring the values of the missing features. Because acquiring the features is costly as well, the objective is to acquire the right set of features so that the sum of the feature acquisition cost and misclassification cost is minimized. We describe the Value of Information Lattice (VOILA), an optimal and efficient feature subset acquisition framework. Unlike the common practice, which is to acquire features greedily, VOILA can reason with subsets of features. VOILA efficiently searches the space of possible feature subsets by discovering and exploiting conditional independence properties between the features and it reuses probabilistic inference computations to further speed up the process. Through empirical evaluation on five medical datasets, we show that the greedy strategy is often reluctant to acquire features, as it cannot forecast the benefit of acquiring multiple features in combination.

Download Full-text