scholarly journals Raking and Relabeling for Imbalanced Data

Author(s):  
Seunghwan Park ◽  
Hae-Wwan Lee ◽  
Jongho Im

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

2022 ◽  
Author(s):  
Seunghwan Park ◽  
Hae-Wwan Lee ◽  
Jongho Im

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>


2021 ◽  
Author(s):  
Petros Barmpas ◽  
Sotiris Tasoulis ◽  
Aristidis G. Vrahatis ◽  
Panagiotis Anagnostou ◽  
Spiros Georgakopoulos ◽  
...  

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.


2019 ◽  
Vol 2019 ◽  
pp. 1-13 ◽  
Author(s):  
Wenhao Xie ◽  
Gongqian Liang ◽  
Zhonghui Dong ◽  
Baoyu Tan ◽  
Baosheng Zhang

The imbalance data refers to at least one of its classes which is usually outnumbered by the other classes. The imbalanced data sets exist widely in the real world, and the classification for them has become one of the hottest issues in the field of data mining. At present, the classification solutions for imbalanced data sets are mainly based on the algorithm-level and the data-level. On the data-level, both oversampling strategies and undersampling strategies are used to realize the data balance via data reconstruction. SMOTE and Random-SMOTE are two classic oversampling algorithms, but they still possess the drawbacks such as blind interpolation and fuzzy class boundaries. In this paper, an improved oversampling algorithm based on the samples’ selection strategy for the imbalanced data classification is proposed. On the basis of the Random-SMOTE algorithm, the support vectors (SV) are extracted and are treated as the parent samples to synthesize the new examples for the minority class in order to realize the balance of the data. Lastly, the imbalanced data sets are classified with the SVM classification algorithm. F-measure value, G-mean value, ROC curve, and AUC value are selected as the performance evaluation indexes. Experimental results show that this improved algorithm demonstrates a good classification performance for the imbalanced data sets.


2012 ◽  
Vol 8 (2) ◽  
pp. 44-63 ◽  
Author(s):  
Baoxun Xu ◽  
Joshua Zhexue Huang ◽  
Graham Williams ◽  
Qiang Wang ◽  
Yunming Ye

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.


2021 ◽  
Vol 25 (5) ◽  
pp. 1169-1185
Author(s):  
Deniu He ◽  
Hong Yu ◽  
Guoyin Wang ◽  
Jie Li

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.


2013 ◽  
Vol 443 ◽  
pp. 741-745
Author(s):  
Hu Li ◽  
Peng Zou ◽  
Wei Hong Han ◽  
Rong Ze Xia

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Jing Zhang ◽  
Guang Lu ◽  
Jiaquan Li ◽  
Chuanwen Li

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.


Sign in / Sign up

Export Citation Format

Share Document