imbalanced data classification Latest Research Papers

Research on Imbalanced Data Classification Based on Classroom-Like Generative Adversarial Networks

Neural Computation ◽

10.1162/neco_a_01470 ◽

2022 ◽

pp. 1-29

Author(s):

Yancheng Lv ◽

Lin Lin ◽

Jie Liu ◽

Hao Guo ◽

Changsheng Tong

Keyword(s):

Loss Function ◽

Imbalanced Data ◽

Data Classification ◽

Generative Adversarial Networks ◽

Complex Data ◽

Channel Measurements ◽

Data Set ◽

Machine Learning Classification ◽

Adversarial Networks ◽

Imbalanced Data Classification

Abstract Most of the research on machine learning classification methods is based on balanced data; the research on imbalanced data classification needs improvement. Generative adversarial networks (GANs) are able to learn high-dimensional complex data distribution without relying on a prior hypothesis, which has become a hot technology in artificial intelligence. In this letter, we propose a new structure, classroom-like generative adversarial networks (CLGANs), to construct a model with multiple generators. Taking inspiration from the fact that teachers arrange teaching activities according to students' learning situation, we propose a weight allocation function to adaptively adjust the influence weight of generator loss function on discriminator loss function. All the generators work together to improve the degree of discriminator and training sample space, so that a discriminator with excellent performance is trained and applied to the tasks of imbalanced data classification. Experimental results on the Case Western Reserve University data set and 2.4 GHz Indoor Channel Measurements data set show that the data classification ability of the discriminator trained by CLGANs with multiple generators is superior to that of other imbalanced data classification models, and the optimal discriminator can be obtained by selecting the right matching scheme of the generator models.

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122.v1 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

An Improved Extreme Learning Machine for Imbalanced Data Classification

IEEE Access ◽

10.1109/access.2022.3142724 ◽

2022 ◽

pp. 1-1

Author(s):

Xiaopeng Zhang ◽

Liangxi Qin

Keyword(s):

Extreme Learning Machine ◽

Imbalanced Data ◽

Data Classification ◽

Imbalanced Data Classification ◽

Learning Machine

ADANOISE: Training neural networks with adaptive noise for imbalanced data classification

Expert Systems with Applications ◽

10.1016/j.eswa.2021.116364 ◽

2021 ◽

pp. 116364

Author(s):

Kyoham Shin ◽

Seokho Kang

Keyword(s):

Neural Networks ◽

Imbalanced Data ◽

Data Classification ◽

Imbalanced Data Classification ◽

Adaptive Noise

Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2021.6.4291 ◽

2021 ◽

Vol 16 (6) ◽

Author(s):

M. Kamaladevi ◽

V. Venkatraman

Keyword(s):

Performance Metrics ◽

False Negative ◽

Similarity Index ◽

Data Classification ◽

Healthcare Sector ◽

Gaussian Kernel ◽

Significant Information ◽

Minority Class ◽

Adaboost Algorithm ◽

Imbalanced Data Classification

In recent years, imbalanced data classification are utilized in several domains including, detecting fraudulent activities in banking sector, disease prediction in healthcare sector and so on. To solve the Imbalanced classification problem at data level, strategy such as undersampling or oversampling are widely used. Sampling technique pose a challenge of significant information loss. The proposed method involves two processes namely, undersampling and classification. First, undersampling is performed by means of Tversky Similarity Indexive Regression model. Here, regression along with the Tversky similarity index is used in analyzing the relationship between two instances from the dataset. Next, Gaussian Kernelized Decision stump AdaBoosting is used for classifying the instances into two classes. Here, the root node in the Decision Stump takes a decision on the basis of the Gaussian Kernel function, considering average of neighboring points accordingly the results is obtained at the leaf node. Weights are also adjusted to minimizing the training errors occurring during classification to find the best classifier. Experimental assessment is performed with two different imbalanced dataset (Pima Indian diabetes and Hepatitis dataset). Various performance metrics such as precision, recall, AUC under ROC score and F1-score are compared with the existing undersampling methods. Experimental results showed that prediction accuracy of minority class has improved and therefore minimizing false positive and false negative.

LDAS: Local density-based adaptive sampling for imbalanced data classification

Expert Systems with Applications ◽

10.1016/j.eswa.2021.116213 ◽

2021 ◽

pp. 116213

Author(s):

Yuanting Yan ◽

Yifei Jiang ◽

Zhong Zheng ◽

Chengjin Yu ◽

Yiwen Zhang ◽

...

Keyword(s):

Adaptive Sampling ◽

Local Density ◽

Imbalanced Data ◽

Data Classification ◽

Imbalanced Data Classification

Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom Permissions

Applied Sciences ◽

10.3390/app112110244 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10244

Author(s):

Minki Kim ◽

Daehan Kim ◽

Changha Hwang ◽

Seongje Cho ◽

Sangchul Han ◽

...

Keyword(s):

Machine Learning ◽

Positive Impact ◽

Imbalanced Data ◽

Classification Performance ◽

Malware Analysis ◽

Learning Approaches ◽

Android Malware ◽

Open Questions ◽

Imbalanced Data Classification ◽

Family Classification

Malware family classification is grouping malware samples that have the same or similar characteristics into the same family. It plays a crucial role in understanding notable malicious patterns and recovering from malware infections. Although many machine learning approaches have been devised for this problem, there are still several open questions including, “Which features, classifiers, and evaluation metrics are better for malware familial classification”? In this paper, we propose a machine learning approach to Android malware family classification using built-in and custom permissions. Each Android app must declare proper permissions to access restricted resources or to perform restricted actions. Permission declaration is an efficient and obfuscation-resilient feature for malware analysis. We developed a malware family classification technique using permissions and conducted extensive experiments with several classifiers on a well-known dataset, DREBIN. We then evaluated the classifiers in terms of four metrics: macrolevel F1-score, accuracy, balanced accuracy (BAC), and the Matthews correlation coefficient (MCC). BAC and the MCC are known to be appropriate for evaluating imbalanced data classification. Our experimental results showed that: (i) custom permissions had a positive impact on classification performance; (ii) even when the same classifier and the same feature information were used, there was a difference up to 3.67% between accuracy and BAC; (iii) LightGBM and AdaBoost performed better than other classifiers we considered.

Binary Imbalanced Data Classification Based on Diversity Oversampling by Generative Models

Information Sciences ◽

10.1016/j.ins.2021.11.058 ◽

2021 ◽

Author(s):

Junhai Zhai ◽

Jiaxing Qi ◽

Chu Shen

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Generative Models ◽

Imbalanced Data Classification

RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Machine Learning ◽

10.1007/s10994-021-06012-8 ◽

2021 ◽

Author(s):

Michał Koziarski ◽

Colin Bellinger ◽

Michał Woźniak

Keyword(s):

Binary Data ◽

State Of The Art ◽

Input Parameter ◽

Health And Safety ◽

Imbalanced Data ◽

Classification Performance ◽

The State ◽

Training Data ◽

Classification Model ◽

Imbalanced Data Classification

AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$ 5 × 2 cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.

imbalanced data classification
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Research on Imbalanced Data Classification Based on Classroom-Like Generative Adversarial Networks

Raking and Relabeling for Imbalanced Data

Raking and Relabeling for Imbalanced Data

An Improved Extreme Learning Machine for Imbalanced Data Classification

ADANOISE: Training neural networks with adaptive noise for imbalanced data classification

Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification

LDAS: Local density-based adaptive sampling for imbalanced data classification

Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom Permissions

Binary Imbalanced Data Classification Based on Diversity Oversampling by Generative Models

RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Export Citation Format

imbalanced data classificationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Research on Imbalanced Data Classification Based on Classroom-Like Generative Adversarial Networks

Raking and Relabeling for Imbalanced Data

Raking and Relabeling for Imbalanced Data

An Improved Extreme Learning Machine for Imbalanced Data Classification

ADANOISE: Training neural networks with adaptive noise for imbalanced data classification

Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification

LDAS: Local density-based adaptive sampling for imbalanced data classification

Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom Permissions

Binary Imbalanced Data Classification Based on Diversity Oversampling by Generative Models

RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

imbalanced data classification
Recently Published Documents