scholarly journals ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Electronics ◽  
2021 ◽  
Vol 10 (6) ◽  
pp. 657
Author(s):  
Krzysztof Gajowniczek ◽  
Tomasz Ząbkowski

This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.

Processes ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 1107
Author(s):  
Krzysztof Gajowniczek ◽  
Tomasz Ząbkowski

This paper presents two new R packages ImbTreeEntropy and ImbTreeAUC for building decision trees, including their interactive construction and analysis, which is a highly regarded feature for field experts who want to be involved in the learning process. ImbTreeEntropy functionality includes the application of generalized entropy functions, such as Renyi, Tsallis, Sharma-Mittal, Sharma-Taneja and Kapur, to measure the impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC measures. The contribution of both packages is that thanks to interactive learning, the user is able to construct a new tree from scratch or, if required, the learning phase enables making a decision regarding the optimal split in ambiguous situations, taking into account each attribute and its cut-off. The main difference with existing solutions is that our packages provide mechanisms that allow for analyzing the trees’ structures (several trees simultaneously) that are built after growing and/or pruning. Both packages support cost-sensitive learning by defining a misclassification cost matrix, as well as weight-sensitive learning. Additionally, the tree structure of the model can be represented as a rule-based model, along with the various quality measures, such as support, confidence, lift, conviction, addedValue, cosine, Jaccard and Laplace.


Author(s):  
Yuan Lan ◽  
Xiaohong Han ◽  
Weiwei Zong ◽  
Xiaojian Ding ◽  
Xiaoyan Xiong ◽  
...  

Rolling element bearings constitute the key parts on rotating machinery, and their fault diagnosis is of great importance. In many real bearing fault diagnosis applications, the number of fault data is much less than the number of normal data, i.e. the data are imbalanced. Many traditional diagnosis methods will get low accuracy because they have a natural tendency to favor the majority class by assuming balanced class distribution or equal misclassification cost. To deal with imbalanced data, in this article, a novel two-step fault diagnosis framework is proposed to diagnose the status of rolling element bearings. Our proposed framework consists of two steps for fault diagnosis, where Step 1 makes use of weighted extreme learning machine in an effort to classify the normal or abnormal categories, and Step 2 further diagnoses the underlying anomaly in detail by using preliminary extreme learning machine. In addition, gravitational search algorithm is applied to further extract the significant features and determine the optimal parameters of the weighted extreme learning machine and extreme learning machine classifiers. The effectiveness of our proposed approach is testified on the raw data collected from the rolling element bearing experiments conducted in our Institute, and the empirical results show that our approach is really fast and can achieve the diagnosis accuracies more than 96%.


Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.


2018 ◽  
Vol 173 ◽  
pp. 01009 ◽  
Author(s):  
Gennady Ososkov ◽  
Pavel Goncharov

The paper demonstrates the advantages of the deep learning networks over the ordinary neural networks on their comparative applications to image classifying. An autoassociative neural network is used as a standalone autoencoder for prior extraction of the most informative features of the input data for neural networks to be compared further as classifiers. The main efforts to deal with deep learning networks are spent for a quite painstaking work of optimizing the structures of those networks and their components, as activation functions, weights, as well as the procedures of minimizing their loss function to improve their performances and speed up their learning time. It is also shown that the deep autoencoders develop the remarkable ability for denoising images after being specially trained. Convolutional Neural Networks are also used to solve a quite actual problem of protein genetics on the example of the durum wheat classification. Results of our comparative study demonstrate the undoubted advantage of the deep networks, as well as the denoising power of the autoencoders. In our work we use both GPU and cloud services to speed up the calculations.


Author(s):  
YANMIN SUN ◽  
ANDREW K. C. WONG ◽  
MOHAMED S. KAMEL

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Qing-Yan Yin ◽  
Jiang-She Zhang ◽  
Chun-Xia Zhang ◽  
Sheng-Cai Liu

Cost-sensitive boosting algorithms have proven successful for solving the difficult class imbalance problems. However, the influence of misclassification costs and imbalance level on the algorithm performance is still not clear. The present paper aims to conduct an empirical comparison of six representative cost-sensitive boosting algorithms, including AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3. These algorithms are thoroughly evaluated by a comprehensive suite of experiments, in which nearly fifty thousands classification models are trained on 17 real-world imbalanced data sets. Experimental results show that AdaC serial algorithms generally outperform AdaCost and CSB when dealing with different imbalance level data sets. Furthermore, the optimality of AdaC2 algorithm stands out around the misclassification costs setting:CN=0.7,CP=1, especially for dealing with strongly imbalanced data sets. In the case of data sets with a low-level imbalance, there is no significant difference between the AdaC serial algorithms. In addition, the results indicate that AdaC1 is comparatively insensitive to the misclassification costs, which is consistent with the finding of the preceding research work.


2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Yanqiu Liu ◽  
Huijuan Lu ◽  
Ke Yan ◽  
Haixia Xia ◽  
Chunlin An

Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.


2011 ◽  
Vol 271-273 ◽  
pp. 1291-1296
Author(s):  
Jin Wei Zhang ◽  
Hui Juan Lu ◽  
Wu Tao Chen ◽  
Yi Lu

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.


Author(s):  
Parag C. Pendharkar ◽  
Sudhir Nanda ◽  
James A. Rodger ◽  
Rahul Bhaskar

This chapter illustrates how a misclassification cost matrix can be incorporated into an evolutionary classification system for medical diagnosis. Most classification systems for medical diagnosis have attempted to minimize the misclassifications (or maximize correctly classified cases). The minimizing misclassification approach assumes that Type I and Type II error costs for misclassification are equal. There is evidence that these costs are not equal and incorporating costs into classification systems can lead to superior outcomes. We use principles of evolution to develop and test a genetic algorithm (GA) based approach that incorporates the asymmetric Type I and Type II error costs. Using simulated and real-life medical data, we show that the proposed approach, incorporating Type I and Type II misclassification costs, results in lower misclassification costs than LDA and GA approaches that do not incorporate these costs.


2011 ◽  
Vol 41 ◽  
pp. 69-95 ◽  
Author(s):  
M. Bilgic ◽  
L. Getoor

We address the cost-sensitive feature acquisition problem, where misclassifying an instance is costly but the expected misclassification cost can be reduced by acquiring the values of the missing features. Because acquiring the features is costly as well, the objective is to acquire the right set of features so that the sum of the feature acquisition cost and misclassification cost is minimized. We describe the Value of Information Lattice (VOILA), an optimal and efficient feature subset acquisition framework. Unlike the common practice, which is to acquire features greedily, VOILA can reason with subsets of features. VOILA efficiently searches the space of possible feature subsets by discovering and exploiting conditional independence properties between the features and it reuses probabilistic inference computations to further speed up the process. Through empirical evaluation on five medical datasets, we show that the greedy strategy is often reluctant to acquire features, as it cannot forecast the benefit of acquiring multiple features in combination.


Sign in / Sign up

Export Citation Format

Share Document