The Use of Prediction Reliability Estimates on Imbalanced Datasets

Data mining techniques are extensively used on medical data, which is typically composed of many normal examples and few interesting ones. When presented with highly imbalanced data, some standard classifiers tend to ignore the minority class which leads to poor performance. Various solutions have been proposed to counter this problem. Random undersampling, random oversampling, and SMOTE (Synthetic Minority Oversampling Technique) are the most well-known approaches. In recent years several approaches to evaluate the reliability of single predictions have been developed. Most recently a simple and efficient approach, based on the classifier’s class probability estimates was shown to outperform the other reliability estimates. The authors propose to use this reliability estimate to improve the SMOTE algorithm. In this study, they demonstrate the positive effects of using the proposed algorithms on artificial datasets. The authors then apply the developed methodology on the problem of predicting the maximal wall shear stress (MWSS) in the human carotid artery bifurcation. The results indicate that it is feasible to improve the classifier’s performance by balancing the data with their versions of the SMOTE algorithm.

Download Full-text

Improving class probability estimates for imbalanced data

Knowledge and Information Systems ◽

10.1007/s10115-013-0670-6 ◽

2013 ◽

Vol 41 (1) ◽

pp. 33-52 ◽

Cited By ~ 15

Author(s):

Byron C. Wallace ◽

Issa J. Dahabreh

Keyword(s):

Imbalanced Data ◽

Probability Estimates ◽

Class Probability

Download Full-text

Class imbalance in gradient boosting classification algorithms: Application to experimental stroke data

Statistical Methods in Medical Research ◽

10.1177/0962280220980484 ◽

2020 ◽

pp. 096228022098048

Author(s):

Olga Lyashevska ◽

Fiona Malone ◽

Eugene MacCarthy ◽

Jens Fiehler ◽

Jan-Hendrik Buhk ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Medical Data ◽

Interactive Effects ◽

Experimental Stroke ◽

Gradient Boosting ◽

Classification Algorithms ◽

Minority Class ◽

Linear Relationships ◽

Boosting Algorithm

Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Imbalanced data hinder the performance of conventional classification methods which aim to improve the overall accuracy of the model without accounting for uneven distribution of the classes. To rectify this, the data can be resampled by oversampling the positive (minority) class until the classes are approximately equally represented. After that, a prediction model such as gradient boosting algorithm can be fitted with greater confidence. This classification method allows for non-linear relationships and deep interactive effects while focusing on difficult areas by iterative shifting towards problematic observations. In this study, we demonstrate application of these methods to medical data and develop a practical framework for evaluation of features contributing into the probability of stroke.

Download Full-text

Boosting with crossover for improving imbalanced medical datasets classification

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i5.3121 ◽

2021 ◽

Vol 10 (5) ◽

pp. 2733-2741

Author(s):

Abeer S. Desuky ◽

Asmaa Hekal Omar ◽

Naglaa M. Mostafa

Keyword(s):

State Of The Art ◽

Imbalanced Data ◽

Ensemble Methods ◽

Healthcare Services ◽

Medical Data ◽

Training Dataset ◽

Minority Class ◽

Healthcare Data ◽

Medical Data Classification ◽

The Common

Due to the common use of electronic health databases in many healthcare services, healthcare data are available for researchers in the classification field to make diseases’ diagnosis more efficient. However, healthcare-medical data classification is most challenging because it is often imbalanced data. Most proposed algorithms are susceptible to classify the samples into the majority class, resulting in the insufficient prediction of the minority class. In this paper, a novel preprocessing method is proposed, using boosting and crossover to optimize the ratio of the two classes by progressively rebuilding the training dataset. This approach is shown to give better performance than other state-of-the-art ensemble methods, which is demonstrated by experiments on seven real-world medical datasets with different imbalance ratios and various distributions.

Download Full-text

Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them)

2012 IEEE 12th International Conference on Data Mining ◽

10.1109/icdm.2012.115 ◽

2012 ◽

Cited By ~ 10

Author(s):

Byron C. Wallace ◽

Issa J. Dahabreh

Keyword(s):

Imbalanced Data ◽

Probability Estimates ◽

Class Probability

Download Full-text

Adaptation Proposed Methods for Handling Imbalanced Datasets based on Over-Sampling Technique

Al-Mustansiriyah Journal of Science ◽

10.23851/mjs.v31i2.740 ◽

2020 ◽

Vol 31 (2) ◽

pp. 25

Author(s):

Liqaa M. Shoohi ◽

Jamila H. Saud

Keyword(s):

Neural Networks ◽

Decision Tree ◽

Back Propagation ◽

Imbalanced Data ◽

Sampling Technique ◽

Poor Performance ◽

Imbalanced Dataset ◽

Minority Class ◽

Data Result

Classification of imbalanced data is an important issue. Many algorithms have been developed for classification, such as Back Propagation (BP) neural networks, decision tree, Bayesian networks etc., and have been used repeatedly in many fields. These algorithms speak of the problem of imbalanced data, where there are situations that belong to more classes than others. Imbalanced data result in poor performance and bias to a class without other classes. In this paper, we proposed three techniques based on the Over-Sampling (O.S.) technique for processing imbalanced dataset and redistributing it and converting it into balanced dataset. These techniques are (Improved Synthetic Minority Over-Sampling Technique (Improved SMOTE), Borderline-SMOTE + Imbalanced Ratio(IR), Adaptive Synthetic Sampling (ADASYN) +IR) Algorithm, where the work these techniques are generate the synthetic samples for the minority class to achieve balance between minority and majority classes and then calculate the IR between classes of minority and majority. Experimental results show ImprovedSMOTE algorithm outperform the Borderline-SMOTE + IR and ADASYN + IR algorithms because it achieves a high balance between minority and majority classes.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

Imbalanced Data Classification Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.443.741 ◽

2013 ◽

Vol 443 ◽

pp. 741-745

Author(s):

Hu Li ◽

Peng Zou ◽

Wei Hong Han ◽

Rong Ze Xia

Keyword(s):

Real World ◽

Imbalanced Data ◽

Data Classification ◽

Comprehensive Analysis ◽

Classification Method ◽

Classification Methods ◽

Real World Data ◽

Minority Class ◽

Imbalanced Data Classification ◽

Traditional Classification

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.

Download Full-text

Embedding Undersampling Rotation Forest for Imbalanced Problem

Computational Intelligence and Neuroscience ◽

10.1155/2018/6798042 ◽

2018 ◽

Vol 2018 ◽

pp. 1-15 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Xiaoyu Diao ◽

Hongbing Liu

Keyword(s):

Imbalanced Data ◽

Feature Space ◽

Original Data ◽

Training Set ◽

Data Set ◽

Minority Class ◽

Rotation Forest ◽

Novel Method ◽

Individual Classifier ◽

The Cost

Rotation Forest is an ensemble learning approach achieving better performance comparing to Bagging and Boosting through building accurate and diverse classifiers using rotated feature space. However, like other conventional classifiers, Rotation Forest does not work well on the imbalanced data which are characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is often much more expensive than the contrary cases. This paper proposes a novel method called Embedding Undersampling Rotation Forest (EURF) to handle this problem (1) sampling subsets from the majority class and learning a projection matrix from each subset and (2) obtaining training sets by projecting re-undersampling subsets of the original data set to new spaces defined by the matrices and constructing an individual classifier from each training set. For the first method, undersampling is to force the rotation matrix to better capture the features of the minority class without harming the diversity between individual classifiers. With respect to the second method, the undersampling technique aims to improve the performance of individual classifiers on the minority class. The experimental results show that EURF achieves significantly better performance comparing to other state-of-the-art methods.

Download Full-text

Individual Prediction Reliability Estimates in Classification and Regression

Intelligent Data Analysis for Real-Life Applications ◽

10.4018/978-1-4666-1806-0.ch003 ◽

2012 ◽

pp. 35-56

Author(s):

Darko Pevec ◽

Zoran Bosnic ◽

Igor Kononenko

Keyword(s):

Empirical Evaluation ◽

Cancer Recurrence ◽

Machine Learning Algorithms ◽

Reliability Estimation ◽

Local Error ◽

Research Areas ◽

Reliability Estimates ◽

Benchmark Datasets ◽

Classification And Regression ◽

Prediction Reliability

Current machine learning algorithms perform well in many problem domains, but in risk-sensitive decision making – for example, in medicine and finance – experts do not rely on common evaluation methods that provide overall assessments of models because such techniques do not provide any information about single predictions. This chapter summarizes the research areas that have motivated the development of various approaches to individual prediction reliability. Based on these motivations, the authors describe six approaches to reliability estimation: inverse transduction, local sensitivity analysis, bagging variance, local cross-validation, local error modelling, and density-based estimation. Empirical evaluation of the benchmark datasets provides promising results, especially for use with decision and regression trees. The testing results also reveal that the reliability estimators exhibit different performance levels when used with different models and in different domains. The authors show the usefulness of individual prediction reliability estimates in attempts to predict breast cancer recurrence. In this context, estimating prediction reliability for individual predictions is of crucial importance for physicians seeking to validate predictions derived using classification and regression models.

Download Full-text