Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution

Author(s):  
A. Fernández ◽  
S. García ◽  
F. Herrera
Author(s):  
YANMIN SUN ◽  
ANDREW K. C. WONG ◽  
MOHAMED S. KAMEL

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.


2021 ◽  
Author(s):  
Tetiana Biloborodova ◽  
Inna Skarga-Bandurova ◽  
Mark Koverha ◽  
Illia Skarha-Bandurov ◽  
Yelyzaveta Yevsieieva

Medical image classification and diagnosis based on machine learning has made significant achievements and gradually penetrated the healthcare industry. However, medical data characteristics such as relatively small datasets for rare diseases or imbalance in class distribution for rare conditions significantly restrains their adoption and reuse. Imbalanced datasets lead to difficulties in learning and obtaining accurate predictive models. This paper follows the FAIR paradigm and proposes a technique for the alignment of class distribution, which enables improving image classification performance in imbalanced data and ensuring data reuse. The experiments on the acne disease dataset support that the proposed framework outperforms the baselines and enable to achieve up to 5% improvement in image classification.


2021 ◽  
Vol 12 (1) ◽  
pp. 1-17
Author(s):  
Swati V. Narwane ◽  
Sudhir D. Sawarkar

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.


Author(s):  
Shivani Vasantbhai Vora ◽  
Rupa G. Mehta ◽  
Shreyas Kishorkumar Patel

Continuously growing technology enhances creativity and simplifies humans' lives and offers the possibility to anticipate and satisfy their unmet needs. Understanding emotions is a crucial part of human behavior. Machines must deeply understand emotions to be able to predict human needs. Most tweets have sentiments of the user. It inherits the imbalanced class distribution. Most machine learning (ML) algorithms are likely to get biased towards the majority classes. The imbalanced distribution of classes gained extensive attention as it has produced many research challenges. It demands efficient approaches to handle the imbalanced data set. Strategies used for balancing the distribution of classes in the case study are handling redundant data, resampling training data, and data augmentation. Six methods related to these techniques have been examined in a case study. Upon conducting experiments on the Twitter dataset, it is seen that merging minority classes and shuffle sentence methods outperform other techniques.


Algorithms ◽  
2019 ◽  
Vol 12 (12) ◽  
pp. 256 ◽  
Author(s):  
Laurent Bulteau ◽  
Mathias Weller

Bioinformatics regularly poses new challenges to algorithm engineers and theoretical computer scientists. This work surveys recent developments of parameterized algorithms and complexity for important NP-hard problems in bioinformatics. We cover sequence assembly and analysis, genome comparison and completion, and haplotyping and phylogenetics. Aside from reporting the state of the art, we give challenges and open problems for each topic.


2011 ◽  
Vol 271-273 ◽  
pp. 1291-1296
Author(s):  
Jin Wei Zhang ◽  
Hui Juan Lu ◽  
Wu Tao Chen ◽  
Yi Lu

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.


2012 ◽  
Vol 2 (1) ◽  
pp. 45-63 ◽  
Author(s):  
Iñaki Albisua ◽  
Olatz Arbelaitz ◽  
Ibai Gurrutxaga ◽  
Aritz Lasarguren ◽  
Javier Muguerza ◽  
...  

Author(s):  
Germina K. Augusthy

A set-valuation of a graph G=(V,E) assigns to the vertices or edges of G elements of the power set of a given nonempty set X subject to certain conditions. A set-indexer of G is an injective set-valuation f:V(G)→2x such that the induced set-valuation f⊕:E(G)→2X on the edges of G defined by f⊕(uv)=f(u)⊕f(v) ∀uv∈E(G) is also injective, where ⊕ denotes the symmetric difference of the subsets of X. Set-valued graphs such as set-graceful graphs, topological set-graceful graphs, set-sequential graphs, set-magic graphs are discussed. Set-valuations with a metric, associated with each pair of vertices is defined as distance pattern distinguishing (DPD) set of a graph (open-distance pattern distinguishing set of a graph (ODPU)) is ∅≠M⊆V(G) and for each u∈V(G), fM(u)={d(u,v): v ϵ M} be the distance-pattern of u with respect to the marker set M. If fM is injective (uniform) then the set M is a DPD (ODPU) set of G and G is a DPD (ODPU)-graph. This chapter briefly reports the existing results, new challenges, open problems, and conjectures that are abound in this topic.


Sign in / Sign up

Export Citation Format

Share Document