Tversky Similarity based UnderSampling with Gaussian Kernelized  Decision Stump Adaboost Algorithm for Imbalanced  Medical Data Classification

M. Kamaladevi; V. Venkatraman

doi:10.15837/ijccc.2021.6.4291

Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2021.6.4291 ◽

2021 ◽

Vol 16 (6) ◽

Author(s):

M. Kamaladevi ◽

V. Venkatraman

Keyword(s):

Performance Metrics ◽

False Negative ◽

Similarity Index ◽

Data Classification ◽

Healthcare Sector ◽

Gaussian Kernel ◽

Significant Information ◽

Minority Class ◽

Adaboost Algorithm ◽

Imbalanced Data Classification

In recent years, imbalanced data classification are utilized in several domains including, detecting fraudulent activities in banking sector, disease prediction in healthcare sector and so on. To solve the Imbalanced classification problem at data level, strategy such as undersampling or oversampling are widely used. Sampling technique pose a challenge of significant information loss. The proposed method involves two processes namely, undersampling and classification. First, undersampling is performed by means of Tversky Similarity Indexive Regression model. Here, regression along with the Tversky similarity index is used in analyzing the relationship between two instances from the dataset. Next, Gaussian Kernelized Decision stump AdaBoosting is used for classifying the instances into two classes. Here, the root node in the Decision Stump takes a decision on the basis of the Gaussian Kernel function, considering average of neighboring points accordingly the results is obtained at the leaf node. Weights are also adjusted to minimizing the training errors occurring during classification to find the best classifier. Experimental assessment is performed with two different imbalanced dataset (Pima Indian diabetes and Hepatitis dataset). Various performance metrics such as precision, recall, AUC under ROC score and F1-score are compared with the existing undersampling methods. Experimental results showed that prediction accuracy of minority class has improved and therefore minimizing false positive and false negative.

Download Full-text

Imbalanced Data Classification Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.443.741 ◽

2013 ◽

Vol 443 ◽

pp. 741-745

Author(s):

Hu Li ◽

Peng Zou ◽

Wei Hong Han ◽

Rong Ze Xia

Keyword(s):

Real World ◽

Imbalanced Data ◽

Data Classification ◽

Comprehensive Analysis ◽

Classification Method ◽

Classification Methods ◽

Real World Data ◽

Minority Class ◽

Imbalanced Data Classification ◽

Traditional Classification

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.

Download Full-text

Oversampling the minority class in a multi-linear feature space for imbalanced data classification

IEEJ Transactions on Electrical and Electronic Engineering ◽

10.1002/tee.22715 ◽

2018 ◽

Vol 13 (10) ◽

pp. 1483-1491 ◽

Cited By ~ 1

Author(s):

Peifeng Liang ◽

Weite Li ◽

Jinglu Hu

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Feature Space ◽

Minority Class ◽

Linear Feature ◽

Imbalanced Data Classification

Download Full-text

Imbalanced data classification based on hybrid resampling and twin support vector machine

Computer Science and Information Systems ◽

10.2298/csis161221017l ◽

2017 ◽

Vol 14 (3) ◽

pp. 579-595 ◽

Cited By ~ 2

Author(s):

Lu Cao ◽

Hong Shen

Keyword(s):

Support Vector Machine ◽

Real Life ◽

Imbalanced Data ◽

Data Classification ◽

Training Data ◽

Twin Support Vector Machine ◽

Support Vector ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalanced Data Classification

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.

Download Full-text

Improving ADABoost Algorithm with Weighted SVM for Imbalanced Data Classification

10.1007/978-3-030-91387-8_9 ◽

2021 ◽

pp. 125-136

Author(s):

Vo Duc Quang ◽

Tran Dinh Khang ◽

Nguyen Minh Huy

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Adaboost Algorithm ◽

Imbalanced Data Classification

Download Full-text

CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification

Neural Computing and Applications ◽

10.1007/s00521-020-05130-z ◽

2020 ◽

Author(s):

Eyad Elyan ◽

Carlos Francisco Moreno-Garcia ◽

Chrisina Jayne

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Minority Class ◽

Imbalanced Data Classification

Download Full-text

Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification

Discrete Dynamics in Nature and Society ◽

10.1155/2021/6647557 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Chunye Wu ◽

Nan Wang ◽

Yu Wang

Keyword(s):

Support Vector Machine ◽

Medical Diagnosis ◽

Imbalanced Data ◽

Data Classification ◽

Recall Rate ◽

Superior Performance ◽

Support Vector ◽

Minority Class ◽

Imbalanced Data Classification ◽

New Strategy

Imbalanced data classification is gaining importance in data mining and machine learning. The minority class recall rate requires special treatment in fields such as medical diagnosis, information security, industry, and computer vision. This paper proposes a new strategy and algorithm based on a cost-sensitive support vector machine to improve the minority class recall rate to 1 because the misclassification of even a few samples can cause serious losses in some physical problems. In the proposed method, the modification employs a margin compensation to make the margin lopsided, enabling decision boundary drift. When the boundary reaches a certain position, the minority class samples will be more generalized to achieve the requirement of a recall rate of 1. In the experiments, the effects of different parameters on the performance of the algorithm were analyzed, and the optimal parameters for a recall rate of 1 were determined. The experimental results reveal that, for the imbalanced data classification problem, the traditional definite cost classification scheme and the models classified using the area under the receiver operating characteristic curve criterion rarely produce results such as a recall rate of 1. The new strategy can yield a minority recall of 1 for imbalanced data as the loss of the majority class is acceptable; moreover, it improves the g -means index. The proposed algorithm provides superior performance in minority recall compared to the conventional methods. The proposed method has important practical significance in credit card fraud, medical diagnosis, and other areas.

Download Full-text

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Applied Sciences ◽

10.3390/app11188546 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8546

Author(s):

Mohamed S. Kraiem ◽

Fernando Sánchez-Hernández ◽

María N. Moreno-García

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

General Nature ◽

Small Range ◽

Minority Class ◽

Unequal Distribution ◽

Imbalanced Data Classification ◽

Wide Range ◽

The Impact ◽

Range Of Values

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.

Download Full-text

A novel imbalanced data classification approach using both under and over sampling

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i5.2785 ◽

2021 ◽

Vol 10 (5) ◽

pp. 2789-2795

Author(s):

Seyyed Mohammad Javadi Moghaddam ◽

Asadollah Noroozi

Keyword(s):

Sampling Methods ◽

Data Distribution ◽

Imbalanced Data ◽

Data Classification ◽

Processing Technique ◽

Imbalanced Dataset ◽

Minority Class ◽

Imbalanced Data Classification ◽

Under Sampling ◽

Sampling Algorithms

The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.

Download Full-text

Review of Imbalanced Data Classification and Approaches Relating to Real-Time Applications

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch001 ◽

2021 ◽

pp. 1-22

Author(s):

Anjali S. More ◽

Dipti P. Rana

Keyword(s):

Real Time ◽

Performance Metrics ◽

Data Distribution ◽

Imbalanced Data ◽

Data Classification ◽

Disease Diagnosis ◽

Skewed Data ◽

Data Imbalance ◽

Imbalanced Data Classification ◽

Real Time Applications

In today's era, multifarious data mining applications deal with leading challenges of handling imbalanced data classification and its impact on performance metrics. There is the presence of skewed data distribution in an ample range of existent time applications which engrossed the attention of researchers. Fraud detection in finance, disease diagnosis in medical applications, oil spill detection, pilfering in electricity, anomaly detection and intrusion detection in security, and other real-time applications constitute uneven data distribution. Data imbalance affects classification performance metrics and upturns the error rate. These leading challenges prompted researchers to investigate imbalanced data applications and related machine learning approaches. The intent of this research work is to review a wide variety of imbalanced data applications of skewed data distribution as binary class data unevenness and multiclass data disproportion, the problem encounters, the variety of approaches to resolve the data imbalance, and possible open research areas.

Download Full-text

Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification

Sensors ◽

10.3390/s21196616 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6616

Author(s):

Leehter Yao ◽

Tung-Bin Lin

Keyword(s):

Particle Swarm Optimization ◽

Computer Simulations ◽

Mahalanobis Distance ◽

Imbalanced Data ◽

Data Classification ◽

Minority Class ◽

Swarm Optimization ◽

Sensing Data ◽

Imbalanced Data Classification ◽

Effective Remedy

The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson–Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.

Download Full-text