Research on Expansion and Classification of Imbalanced Data Based on SMOTE Algorithm

Abstract With the development of artificial intelligence, the research of medical auxiliary diagnosis based on big data classification is considered as a new technology that can be expected. Due to the different condition in the collection of different samples, medical big data often has imbalances. The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper. The principle of Normal random distribution is introduced to expand the minority sample, so that the new sample points are distributed closer to the center of the minority sample with a higher probability. In addition, the distribution of the generated data is controlled based on the characteristics of the Normal distribution. And the influence of the statistical characteristics of the original data on the parameter(variance) selection is analyzed based on the inter-class distance and sample variance. Experiments show that the proposed algorithm has better classification effect on the Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin imbalanced datasets than the original SMOTE algorithm according to AUC, OOB, F-value, G-value.

Download Full-text

Research on expansion and classification of imbalanced data based on SMOTE algorithm

Scientific Reports ◽

10.1038/s41598-021-03430-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shujuan Wang ◽

Yuntao Dai ◽

Jihong Shen ◽

Jingxue Xuan

Keyword(s):

Big Data ◽

Class Imbalance ◽

Imbalanced Data ◽

Original Data ◽

Classification Performance ◽

Parameter Selection ◽

Sample Collection ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Sample Points

AbstractWith the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11192 ◽

2018 ◽

Vol 61 ◽

pp. 863-905 ◽

Cited By ~ 150

Author(s):

Alberto Fernandez ◽

Salvador Garcia ◽

Francisco Herrera ◽

Nitesh V. Chawla

Keyword(s):

Big Data ◽

Open Source ◽

Supervised Learning ◽

Incremental Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Multilabel Classification ◽

Current State ◽

Software Packages ◽

State Of Affairs

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages - from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.

Download Full-text

A Multi-Objective Ensemble Method for Class Imbalance Learning

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.2017010102 ◽

2017 ◽

Vol 2 (1) ◽

pp. 16-34

Author(s):

Sajad Emamipour ◽

Rasoul Sali ◽

Zahra Yousefi

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Ensemble Classifiers ◽

Feature Selection Technique ◽

Multi Objective ◽

Proposed Model ◽

Training Examples ◽

Imbalance Learning ◽

Class Imbalance Learning

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

Adaptive Ensemble Method Based on Spatial Characteristics for Classifying Imbalanced Data

Scientific Programming ◽

10.1155/2017/3704525 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 1

Author(s):

Lei Wang ◽

Lei Zhao ◽

Guan Gui ◽

Baoyu Zheng ◽

Ruochen Huang

Keyword(s):

Euclidean Distance ◽

Average Distance ◽

Class Imbalance ◽

Imbalanced Data ◽

Ensemble Methods ◽

Classification Performance ◽

Ensemble Method ◽

Training Data ◽

Performance Loss ◽

And Training

The class imbalance problems often reduce the classification performance of the majority of standard classifiers. Many methods have been developed to solve these problems, such as cost-sensitive learning methods, synthetic minority oversampling technique (SMOTE), and random oversampling (ROS). However, the existing methods still have some problems due to the possible performance loss of useful information and overfitting. To solve the problems, we propose an adaptive ensemble method by using the most advanced feature of self-adaption by considering an average Euclidean distance between test data and training data, where the average distance is calculated by k-nearest neighbors (KNN) algorithm. Simulation results are provided to confirm that the proposed method has a better performance than existing ensemble methods.

Download Full-text

Cross-Modality Interaction Network for Equine Activity Recognition Using Imbalanced Multi-Modal Data

Sensors ◽

10.3390/s21175818 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5818

Author(s):

Axiu Mao ◽

Endai Huang ◽

Haiming Gan ◽

Rebecca S. V. Parkes ◽

Weitao Xu ◽

...

Keyword(s):

Activity Recognition ◽

Network Architecture ◽

Feature Fusion ◽

Recognition Performance ◽

Class Imbalance ◽

Imbalanced Data ◽

Interaction Network ◽

Classification Performance ◽

Sensor Data ◽

Class Imbalance Problem

With the recent advances in deep learning, wearable sensors have increasingly been used in automated animal activity recognition. However, there are two major challenges in improving recognition performance—multi-modal feature fusion and imbalanced data modeling. In this study, to improve classification performance for equine activities while tackling these two challenges, we developed a cross-modality interaction network (CMI-Net) involving a dual convolution neural network architecture and a cross-modality interaction module (CMIM). The CMIM adaptively recalibrated the temporal- and axis-wise features in each modality by leveraging multi-modal information to achieve deep intermodality interaction. A class-balanced (CB) focal loss was adopted to supervise the training of CMI-Net to alleviate the class imbalance problem. Motion data was acquired from six neck-attached inertial measurement units from six horses. The CMI-Net was trained and verified with leave-one-out cross-validation. The results demonstrated that our CMI-Net outperformed the existing algorithms with high precision (79.74%), recall (79.57%), F1-score (79.02%), and accuracy (93.37%). The adoption of CB focal loss improved the performance of CMI-Net, with increases of 2.76%, 4.16%, and 3.92% in precision, recall, and F1-score, respectively. In conclusion, CMI-Net and CB focal loss effectively enhanced the equine activity classification performance using imbalanced multi-modal sensor data.

Download Full-text

Medical Big Data Analytics and Smart Internet of Things-enabled Mobile-based Health Monitoring Systems

American Journal of Medical Research ◽

10.22381/ajmr6220194 ◽

2019 ◽

Vol 6 (2) ◽

pp. 31 ◽

Cited By ~ 1

Keyword(s):

Big Data ◽

Internet Of Things ◽

Health Monitoring ◽

Data Analytics ◽

Big Data Analytics ◽

Monitoring Systems ◽

Health Monitoring Systems ◽

Medical Big Data

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text