Imbalanced Data
Recently Published Documents





Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.

Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6616
Leehter Yao ◽  
Tung-Bin Lin

The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson–Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.

2021 ◽  
pp. 107590
Yang-Geng Fu ◽  
Ji-Feng Ye ◽  
Ze-Feng Yin ◽  
Long-Jiang Chen ◽  
Ying-Ming Wang ◽  

Himani Tiwari

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices

Zhang Yan ◽  
Du Hongle ◽  
Ke Gang ◽  
Zhang Lin ◽  
Yeh-Cheng Chen

Ehsan Aminian ◽  
Rita P. Ribeiro ◽  
João Gama

2021 ◽  
Ming Li ◽  
Dezhi Han ◽  
Dun Li ◽  
Han Liu ◽  
Chin- Chen Chang

Abstract Network intrusion detection, which takes the extraction and analysis of network traffic features as the main method, plays a vital role in network security protection. The current network traffic feature extraction and analysis for network intrusion detection mostly uses deep learning algorithms. Currently, deep learning requires a lot of training resources, and have weak processing capabilities for imbalanced data sets. In this paper, a deep learning model (MFVT) based on feature fusion network and Vision Transformer architecture is proposed, to which improves the processing ability of imbalanced data sets and reduces the sample data resources needed for training. Besides, to improve the traditional raw traffic features extraction methods, a new raw traffic features extraction method (CRP) is proposed, the CPR uses PCA algorithm to reduce all the processed digital traffic features to the specified dimension. On the IDS 2017 dataset and the IDS 2012 dataset, the ablation experiments show that the performance of the proposed MFVT model is significantly better than other network intrusion detection models, and the detection accuracy can reach the state-of-the-art level. And, When MFVT model is combined with CRP algorithm, the detection accuracy is further improved to 99.99%.

2021 ◽  
Vol 11 (1) ◽  
Malihe Javidi ◽  
Saeid Abbaasi ◽  
Sara Naybandi Atashi ◽  
Mahdi Jampour

AbstractWith the presence of novel coronavirus disease at the end of 2019, several approaches were proposed to help physicians detect the disease, such as using deep learning to recognize lung involvement based on the pattern of pneumonia. These approaches rely on analyzing the CT images and exploring the COVID-19 pathologies in the lung. Most of the successful methods are based on the deep learning technique, which is state-of-the-art. Nevertheless, the big drawback of the deep approaches is their need for many samples, which is not always possible. This work proposes a combined deep architecture that benefits both employed architectures of DenseNet and CapsNet. To more generalize the deep model, we propose a regularization term with much fewer parameters. The network convergence significantly improved, especially when the number of training data is small. We also propose a novel Cost-sensitive loss function for imbalanced data that makes our model feasible for the condition with a limited number of positive data. Our novelties make our approach more intelligent and potent in real-world situations with imbalanced data, popular in hospitals. We analyzed our approach on two publicly available datasets, HUST and COVID-CT, with different protocols. In the first protocol of HUST, we followed the original paper setup and outperformed it. With the second protocol of HUST, we show our approach superiority concerning imbalanced data. Finally, with three different validations of the COVID-CT, we provide evaluations in the presence of a low number of data along with a comparison with state-of-the-art.

2021 ◽  
Vol 11 (4) ◽  
pp. 1-13
Khan Md. Hasib ◽  
Nurul Akter Towhid ◽  
Md Rafiqul Islam

Imbalanced data presents many difficulties, as the majority of learners will be prejudice against the majority class, and in severe cases, may fully disregard the minority class. Over the last few decades, class inequality has been extensively researched using traditional machine learning techniques. However, there is relatively little analytical research in the field of deep learning with class inequality. In this article, the authors classify the imbalanced data with the combination of both sampling method and deep learning method. They propose a novel sampling-based deep learning method (HSDLM) to address the class imbalance problem. They preprocess the data with label encoding and remove the noisy data with the under-sampling technique edited nearest neighbor (ENN) algorithm. They also balance the data using the over-sampling technique SMOTE and apply parallelly three types of long short-term memory networks, which is a deep learning classifier. The experimental findings indicate that HSDLM is a promising and fruitful solution to working with strongly imbalanced datasets.

2021 ◽  
Vol 11 (18) ◽  
pp. 8546
Mohamed S. Kraiem ◽  
Fernando Sánchez-Hernández ◽  
María N. Moreno-García

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.

Sign in / Sign up

Export Citation Format

Share Document