Unbalanced Sequential Data Classification using Extreme Outlier Elimination and Sampling Techniques

Predicting minority class sequence patterns from the noisy and unbalanced sequential datasets is a challenging task. To solve this problem, we proposed a new approach called extreme outlier elimination and hybrid sampling technique. We use k Reverse Nearest Neighbors (kRNNs) concept as a data cleaning method for eliminating extreme outliers in minority regions. Hybrid sampling technique, a combination of SMOTE to oversample the minority class sequences and random undersampling to undersample the majority class sequences is used for improving minority class prediction. This method was evaluated in terms of minority class precision, recall and f-measure on syntactically simulated, highly overlapped sequential dataset named Hill-Valley. We conducted the experiments with k-Nearest Neighbour classifier and compared the performance of our approach against simple hybrid sampling technique. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions for both fraud and non-fraud classes.

Download Full-text

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Complexity ◽

10.1155/2021/6877284 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Liping Chen ◽

Jiabao Jiang ◽

Yong Zhang

Keyword(s):

Big Data ◽

Sampling Method ◽

Sampling Methods ◽

Data Partition ◽

Minority Class ◽

F Measure ◽

Better Than ◽

Hybrid Sampling

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.

Download Full-text

Multi-Label Classification with PSO based Synthetic Minority Over-Sampling Technique (Psosmote) for Imbalanced Samples

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8437.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 4039-4042

Keyword(s):

Data Mining ◽

Sampling Rate ◽

Sampling Technique ◽

Unbalanced Data ◽

Optimal Sampling ◽

Minority Class ◽

Swarm Optimization ◽

F Measure ◽

Predictive Clustering Trees

Recently, the learning from unbalanced data has emerged to be a pre-dominant problem in several applications and in that multi label classification is an evolving data mining task, learning from unbalanced multilabel data is being examined. However, the available algorithms-based SMOTE makes use of the same sampling rate for every instance of the minority class. This leads to sub-optimal performance. To deal with this problem, a new Particle Swarm Optimization based SMOTE (PSOSMOTE) algorithm is proposed. The PSOSMOTE algorithm employs diverse sampling rates for multiple minority class instances and gets the fusion of optimal sampling rates and to deal with classification of unbalanced datasets. Then, Bayesian technique is combined with Random forest for multilabel classification (BARF-MLC) is to address the inherent label dependencies among samples such as ML-FOREST classifier, Predictive Clustering Trees (PCT), Hierarchy of Multi Label Classifier (HOMER) by taking the different metrics including precision, recall, F-measure, Accuracy and Error Rate.

Download Full-text

An under sampled k-means approach for handlingimbalanced data using diversified distribution

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i1.8.9984 ◽

2018 ◽

Vol 7 (1.8) ◽

pp. 113 ◽

Cited By ~ 1

Author(s):

G Shobana ◽

Bhanu Prakash Battula

Keyword(s):

Research Issue ◽

Minority Class ◽

Effective Utilization ◽

Specific Strategy ◽

Open Research ◽

Novel Approach ◽

Under Sampling ◽

F Measure

Some true applications uncover troubles in taking in classifiers from imbalanced information. Albeit a few techniques for enhancing classifiers have been presented, the distinguishing proof of conditions for the effective utilization of the specific strategy is as yet an open research issue. It is likewise worth to think about the idea of imbalanced information, qualities of the minority class dissemination and their impact on arrangement execution. In any case, current investigations on imbalanced information trouble factors have been predominantly finished with manufactured datasets and their decisions are not effortlessly material to this present reality issues, likewise on the grounds that the techniques for their distinguishing proof are not adequately created. In this paper, we recommended a novel approach Under Sampling Utilizing Diversified Distribution (USDD) for explaining the issues of class lopsidedness in genuine datasets by thinking about the systems of recognizable pieces of proof and expulsion of marginal, uncommon and anomalies sub groups utilizing k-implies. USDD utilizes exceptional procedure for recognizable proof of these kinds of cases, which depends on breaking down a class dissemination in a nearby neighborhood of the considered case utilizing k-closest approach. The exploratory outcomes recommend that the proposed USDD approach performs superior to the looked at approach as far as AUC, accuracy, review and f-measure.

Download Full-text

Beyond Uniform Reverse Sampling: A Hybrid Sampling Technique for Misinformation Prevention

IEEE INFOCOM 2019 - IEEE Conference on Computer Communications ◽

10.1109/infocom.2019.8737485 ◽

2019 ◽

Cited By ~ 1

Author(s):

Guangmo Amo Tong ◽

Ding-Zhu Du

Keyword(s):

Sampling Technique ◽

Hybrid Sampling

Download Full-text

Improving undersampling-based ensemble with rotation forest for imbalanced problem

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-1805-159 ◽

2019 ◽

pp. 1371-1386 ◽

Cited By ~ 1

Author(s):

Huaping GUO ◽

Xiaoyu DIAO ◽

Hongbing LIU

Keyword(s):

High Performance ◽

State Of The Art ◽

Class Imbalance ◽

Imbalanced Data ◽

Ensemble Methods ◽

Sampling Technique ◽

Robust Methods ◽

Limited Data ◽

Minority Class ◽

Rotation Forest

As one of the most challenging and attractive issues in pattern recognition and machine learning, the imbalanced problem has attracted increasing attention. For two-class data, imbalanced data are characterized by the size of one class (majority class) being much larger than that of the other class (minority class), which makes the constructed models focus more on the majority class and ignore or even misclassify the examples of the minority class. The undersampling-based ensemble, which learns individual classifiers from undersampled balanced data, is an effective method to cope with the class-imbalance data. The problem in this method is that the size of the dataset to train each classifier is notably small; thus, how to generate individual classifiers with high performance from the limited data is a key to the success of the method. In this paper, rotation forest (an ensemble method) is used to improve the performance of the undersampling-based ensemble on the imbalanced problem because rotation forest has higher performance than other ensemble methods such as bagging, boosting, and random forest, particularly for small-sized data. In addition, rotation forest is more sensitive to the sampling technique than some robust methods including SVM and neural networks; thus, it is easier to create individual classifiers with diversity using rotation forest. Two versions of the improved undersampling-based ensemble methods are implemented: 1) undersampling subsets from the majority class and learning each classifier using the rotation forest on the data obtained by combing each subset with the minority class and 2) similarly to the first method, with the exception of removing the majority class examples that are correctly classified with high confidence after learning each classifier for further consideration. The experimental results show that the proposed methods show significantly better performance on measures of recall, g-mean, f-measure, and AUC than other state-of-the-art methods on 30 datasets with various data distributions and different imbalance ratios.

Download Full-text

SYNTHETIC OVERSAMPLING OF INSTANCES USING CLUSTERING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500085 ◽

2013 ◽

Vol 22 (02) ◽

pp. 1350008 ◽

Cited By ~ 2

Author(s):

ATLÁNTIDA I. SÁNCHEZ ◽

EDUARDO F. MORALES ◽

JESUS A. GONZALEZ

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Tuning Parameters ◽

New Methods ◽

Real World Applications ◽

Noisy Examples ◽

F Measure ◽

Better Than

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.

Download Full-text

A Survey on Imbalanced Data Handling Techniques for Classification

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/089102021 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1341-1347

Keyword(s):

Real World ◽

Imbalanced Data ◽

Learning Task ◽

High Accuracy ◽

Data Handling ◽

Imbalanced Dataset ◽

Minority Class ◽

Class Labels ◽

Very High ◽

F Measure

Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.

Download Full-text

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488280 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-37

Author(s):

Robert A. Sowah ◽

Bernard Kuditchar ◽

Godfrey A. Mills ◽

Amevi Acakpovi ◽

Raphael A. Twum ◽

...

Keyword(s):

Geometric Mean ◽

Class Imbalance ◽

Sampling Technique ◽

Data Repository ◽

Support Vector ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

High Degree ◽

Hybrid Sampling

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

Download Full-text

Boosting Minority Class Prediction on Imbalanced Point Cloud Data

Applied Sciences ◽

10.3390/app10030973 ◽

2020 ◽

Vol 10 (3) ◽

pp. 973 ◽

Cited By ~ 1

Author(s):

Hsien-I Lin ◽

Mihn Cong Nguyen

Keyword(s):

Point Cloud ◽

Imbalanced Data ◽

Automated Assignment ◽

Test Accuracy ◽

Point Cloud Data ◽

Minority Class ◽

Class Prediction ◽

Cloud Data ◽

Data Imbalance

Data imbalance during the training of deep networks can cause the network to skip directly to learning minority classes. This paper presents a novel framework by which to train segmentation networks using imbalanced point cloud data. PointNet, an early deep network used for the segmentation of point cloud data, proved effective in the point-wise classification of balanced data; however, performance degraded when imbalanced data was used. The proposed approach involves removing between-class data point imbalances and guiding the network to pay more attention to majority classes. Data imbalance is alleviated using a hybrid-sampling method involving oversampling, as well as undersampling, respectively, to decrease the amount of data in majority classes and increase the amount of data in minority classes. A balanced focus loss function is also used to emphasize the minority classes through the automated assignment of costs to the various classes based on their density in the point cloud. Experiments demonstrate the effectiveness of the proposed training framework when provided a point cloud dataset pertaining to six objects. The mean intersection over union (mIoU) test accuracy results obtained using PointNet training were as follows: XYZRGB data (91%) and XYZ data (86%). The mIoU test accuracy results obtained using the proposed scheme were as follows: XYZRGB data (98%) and XYZ data (93%).

Download Full-text

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study (Preprint)

10.2196/preprints.14499 ◽

2019 ◽

Author(s):

Chin Lin ◽

Yu-Sheng Lou ◽

Dung-Jang Tsai ◽

Chia-Cheng Lee ◽

Chia-Jung Hsu ◽

...

Keyword(s):

General Hospital ◽

Sampling Method ◽

Model Performance ◽

Word Embedding ◽

Superior Performance ◽

Word Embeddings ◽

Technology Improvement ◽

Icd 10 ◽

F Measure ◽

Hybrid Sampling

BACKGROUND Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three–character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.

Download Full-text