imbalanced data Latest Research Papers

Abstract Most of the research on machine learning classification methods is based on balanced data; the research on imbalanced data classification needs improvement. Generative adversarial networks (GANs) are able to learn high-dimensional complex data distribution without relying on a prior hypothesis, which has become a hot technology in artificial intelligence. In this letter, we propose a new structure, classroom-like generative adversarial networks (CLGANs), to construct a model with multiple generators. Taking inspiration from the fact that teachers arrange teaching activities according to students' learning situation, we propose a weight allocation function to adaptively adjust the influence weight of generator loss function on discriminator loss function. All the generators work together to improve the degree of discriminator and training sample space, so that a discriminator with excellent performance is trained and applied to the tasks of imbalanced data classification. Experimental results on the Case Western Reserve University data set and 2.4 GHz Indoor Channel Measurements data set show that the data classification ability of the discriminator trained by CLGANs with multiple generators is superior to that of other imbalanced data classification models, and the optimal discriminator can be obtained by selecting the right matching scheme of the generator models.

Download Full-text

Application of Artificial Intelligence in Diagnosis of Craniopharyngioma

Frontiers in Neurology ◽

10.3389/fneur.2021.752119 ◽

2022 ◽

Vol 12 ◽

Author(s):

Caijie Qin ◽

Wenxing Hu ◽

Xinsheng Wang ◽

Xibo Ma

Keyword(s):

Artificial Intelligence ◽

Clinical Diagnosis ◽

Great Influence ◽

Imbalanced Data ◽

Learning Model ◽

Future Research ◽

Data Set ◽

Congenital Brain Tumor ◽

Pituitary Dysfunction ◽

Artificial Intelligence Technology

Craniopharyngioma is a congenital brain tumor with clinical characteristics of hypothalamic-pituitary dysfunction, increased intracranial pressure, and visual field disorder, among other injuries. Its clinical diagnosis mainly depends on radiological examinations (such as Computed Tomography, Magnetic Resonance Imaging). However, assessing numerous radiological images manually is a challenging task, and the experience of doctors has a great influence on the diagnosis result. The development of artificial intelligence has brought about a great transformation in the clinical diagnosis of craniopharyngioma. This study reviewed the application of artificial intelligence technology in the clinical diagnosis of craniopharyngioma from the aspects of differential classification, prediction of tissue invasion and gene mutation, prognosis prediction, and so on. Based on the reviews, the technical route of intelligent diagnosis based on the traditional machine learning model and deep learning model were further proposed. Additionally, in terms of the limitations and possibilities of the development of artificial intelligence in craniopharyngioma diagnosis, this study discussed the attentions required in future research, including few-shot learning, imbalanced data set, semi-supervised models, and multi-omics fusion.

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122.v1 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text

Boundary-Aware Hashing for Hamming Space Retrieval

Applied Sciences ◽

10.3390/app12010508 ◽

2022 ◽

Vol 12 (1) ◽

pp. 508

Author(s):

Wenjin Hu ◽

Yukun Chen ◽

Lifang Wu ◽

Ge Shi ◽

Meng Jian

Keyword(s):

Exponential Function ◽

Large Scale ◽

Imbalanced Data ◽

Logarithmic Function ◽

Exponential Functions ◽

Boundary Making ◽

Benchmark Datasets ◽

Hamming Space ◽

The Absolute ◽

Large Scale Image Retrieval

Hamming space retrieval is a hot area of research in deep hashing because it is effective for large-scale image retrieval. Existing hashing algorithms have not fully used the absolute boundary to discriminate the data inside and outside the Hamming ball, and the performance is not satisfying. In this paper, a boundary-aware contrastive loss is designed. It involves an exponential function with absolute boundary (i.e., Hamming radius) information for dissimilar pairs and a logarithmic function to encourage small distance for similar pairs. It achieves a push that is bigger than the pull inside the Hamming ball, and the pull is bigger than the push outside the ball. Furthermore, a novel Boundary-Aware Hashing (BAH) architecture is proposed. It discriminatively penalizes the dissimilar data inside and outside the Hamming ball. BAH enables the influence of extremely imbalanced data to be reduced without up-weight to similar pairs or other optimization strategies because its exponential function rapidly converges outside the absolute boundary, making a huge contrast difference between the gradients of the logarithmic and exponential functions. Extensive experiments conducted on four benchmark datasets show that the proposed BAH obtains higher performance for different code lengths, and it has the advantage of handling extremely imbalanced data.

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text

Sleep Stage Classification For Medical Purposes: Machine Learning Evaluation For Imbalanced Data

10.21203/rs.3.rs-1208553/v1 ◽

2022 ◽

Author(s):

Bens Pardamean ◽

Arif Budiarto ◽

Bharuno Mahesworo ◽

Alam Ahmad Hidayat ◽

Digdo Sudigyo

Keyword(s):

Machine Learning ◽

Time Series ◽

Sleep Stage ◽

Imbalanced Data ◽

Wearable Device ◽

Prediction Score ◽

Series Data ◽

Sleep Stages ◽

Minority Class ◽

Class Weight

Abstract Background: Sleep is commonly associated with physical and mental health status. Sleep quality can be determined from the dynamic of sleep stages during the night. Data from the wearable device can potentially be used as predictors to classify the sleep stage. Robust Machine Learning (ML) model is needed to learn the pattern within wearable data to be associated with the sleep-wake classification, especially to handle the imbalanced proportion between wake and sleep stages. In this study, we incorporated a publicy available dataset consists of three features captured from a consumer wearable device and the labelled sleep stages from a polysomnogram. We implemented Random Forest, Support Vector Machine , Extreme Gradiet Boosting Tree, Densed Neural Network (DNN), and Long Short-Term Memory (LSTM), complemented by three strategies to handle the imbalanced data problem. Results: In total, we included more than 24,815 rows of preprocessed data from 31 samples. The proportion of minority-majority data is 1:10. In classifying this extreme imbalanced data, the DNN model was found to have the best performance compared to the previous best model, which is based on basic Multi-Layer Perceptron. Our best model successfully achieved a 12% higher specificity score (prediction score for minority class) and 1% improvement on the sensitivity score (prediction score for majority class) by including all features in the model. This achievement was affected by the implementation of custom class weight and oversampling strategy. In contrast, when we only used two features, XGB achieved a specificity improvement only by 1%, while keeping the sensitivity at the same level.Conclusions: The non-linear operation within the DNN model could successfully learn the hidden pattern from the combination of three features. Additionally, the class weight parameter avoided the model ignoring the minority class by giving more weight for this class in the loss function. The feature engineering process seemed to obscure the time-series characteristics within the data. This is why LSTM, as one of the best methods for time-series data, failed to perform well in this classification task.

Download Full-text