HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.

Download Full-text

SYNTHETIC OVERSAMPLING OF INSTANCES USING CLUSTERING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500085 ◽

2013 ◽

Vol 22 (02) ◽

pp. 1350008 ◽

Cited By ~ 2

Author(s):

ATLÁNTIDA I. SÁNCHEZ ◽

EDUARDO F. MORALES ◽

JESUS A. GONZALEZ

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Tuning Parameters ◽

New Methods ◽

Real World Applications ◽

Noisy Examples ◽

F Measure ◽

Better Than

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.

Download Full-text

How to balance the bioinformatics data: pseudo-negative sampling

BMC Bioinformatics ◽

10.1186/s12859-019-3269-4 ◽

2019 ◽

Vol 20 (S25) ◽

Author(s):

Yongqing Zhang ◽

Shaojie Qiao ◽

Rongzhao Lu ◽

Nan Han ◽

Dingxiang Liu ◽

...

Keyword(s):

Correlation Coefficient ◽

Sampling Methods ◽

Pearson Correlation ◽

Difficult Problem ◽

Data Sampling ◽

Classification Problems ◽

Minority Class ◽

Sensitivity Specificity ◽

Imbalance Dataset ◽

Better Than

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Download Full-text

Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study (Preprint)

10.2196/preprints.14499 ◽

2019 ◽

Author(s):

Chin Lin ◽

Yu-Sheng Lou ◽

Dung-Jang Tsai ◽

Chia-Cheng Lee ◽

Chia-Jung Hsu ◽

...

Keyword(s):

General Hospital ◽

Sampling Method ◽

Model Performance ◽

Word Embedding ◽

Superior Performance ◽

Word Embeddings ◽

Technology Improvement ◽

Icd 10 ◽

F Measure ◽

Hybrid Sampling

BACKGROUND Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three–character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.

Download Full-text

Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping

JOIV International Journal on Informatics Visualization ◽

10.30630/joiv.5.1.420 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Hartono Hartono ◽

Erianto Ongko

Keyword(s):

Sampling Method ◽

Hybrid Approach ◽

Class Imbalance ◽

Classifier Ensembles ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem ◽

Classifier Performance ◽

R Value ◽

Hybrid Sampling

Class imbalance is one of the main problems in classification because the number of samples in majority class is far more than the number of samples in minority class. The class imbalance problem in the multi-class dataset is much more difficult to handle than the problem in the two class dataset. This multi-class imbalance problem is even more complicated if it is accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach which combines sampling and classifier ensembles. However, in terms of diversity among classifiers, hybrid approach that combine sampling and classifier ensembles will give better results. HAR-MI delivers excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of sample in minority class. However, this SMOTE also has a weakness where if there is an extremely imbalanced dataset and a large number of attributes there will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sazmpling (MOSHS) method and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based undersampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value.

Download Full-text

Unbalanced Sequential Data Classification using Extreme Outlier Elimination and Sampling Techniques

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch005 ◽

2012 ◽

pp. 83-93 ◽

Cited By ~ 1

Author(s):

T.Maruthi Padmaja ◽

Raju S. Bapi ◽

P. Radha Krishna

Keyword(s):

Sampling Technique ◽

Sequential Data ◽

Minority Class ◽

Class Prediction ◽

Sequence Patterns ◽

Nearest Neighbour Classifier ◽

Minority Regions ◽

Extreme Outlier ◽

F Measure ◽

Hybrid Sampling

Predicting minority class sequence patterns from the noisy and unbalanced sequential datasets is a challenging task. To solve this problem, we proposed a new approach called extreme outlier elimination and hybrid sampling technique. We use k Reverse Nearest Neighbors (kRNNs) concept as a data cleaning method for eliminating extreme outliers in minority regions. Hybrid sampling technique, a combination of SMOTE to oversample the minority class sequences and random undersampling to undersample the majority class sequences is used for improving minority class prediction. This method was evaluated in terms of minority class precision, recall and f-measure on syntactically simulated, highly overlapped sequential dataset named Hill-Valley. We conducted the experiments with k-Nearest Neighbour classifier and compared the performance of our approach against simple hybrid sampling technique. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions for both fraud and non-fraud classes.

Download Full-text

A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1291 ◽

2011 ◽

Vol 271-273 ◽

pp. 1291-1296

Author(s):

Jin Wei Zhang ◽

Hui Juan Lu ◽

Wu Tao Chen ◽

Yi Lu

Keyword(s):

Sampling Method ◽

Sampling Methods ◽

Imbalanced Data ◽

Distribution Data ◽

Data Sets ◽

Data Set ◽

Minority Class ◽

Misclassification Cost ◽

Cost Sensitive Learning ◽

Class Distribution

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

Download Full-text

Underwater video surveys provide a more complete picture of littoral fish populations than seine samples in clear Florida springs

Marine and Freshwater Research ◽

10.1071/mf18288 ◽

2019 ◽

Vol 70 (8) ◽

pp. 1178

Author(s):

Kirsten Work ◽

Coramarie Jifu Jennings

Keyword(s):

Species Richness ◽

Sampling Method ◽

Sampling Methods ◽

Water Clarity ◽

Fish Abundance ◽

Video Assessment ◽

Species Richness And Diversity ◽

Abundance And Diversity ◽

Number Of Individuals ◽

Better Than

Traditional fish-sampling methods may be problematic because of public use or safety concerns. In this study, we compared one common sampling method with video assessment of fish abundance and diversity in three springs that differed in water clarity and structure. At each of four or five sites per spring, we placed one GoPro camera on each bank for 12min and followed the filming with seine sampling. On the video, we counted the maximum number of individuals of each species observed within one frame (MaxN) and summed these counts to produce an estimate of fish abundance (SumMaxN). Then we compared abundance (SumMaxN), species richness and diversity between seine and video samples across all three springs. Video produced higher estimates of abundance (SumMaxN), species richness, and diversity than did seine sampling. However, this effect was largely confined to species richness and diversity differences between sample methods in the structurally complex spring; differences were subtle or non-existent in the low-structure spring and in the turbid spring. In all three springs, video captured relatively more centrarchids; these taxa were captured only rarely in seine samples. Therefore, video sampling performed as well or better than did seine sampling for fish-assemblage assessment in these clear springs.

Download Full-text

Implicitly adaptive importance sampling

Statistics and Computing ◽

10.1007/s11222-020-09982-2 ◽

2021 ◽

Vol 31 (2) ◽

Author(s):

Topi Paananen ◽

Juho Piironen ◽

Paul-Christian Bürkner ◽

Aki Vehtari

Keyword(s):

Importance Sampling ◽

Sampling Method ◽

Sampling Methods ◽

Probability Distributions ◽

Adaptive Importance Sampling ◽

Importance Sampling Method ◽

Leave One Out ◽

Standard Probability ◽

Better Than ◽

Current Proposal

AbstractAdaptive importance sampling is a class of techniques for finding good proposal distributions for importance sampling. Often the proposal distributions are standard probability distributions whose parameters are adapted based on the mismatch between the current proposal and a target distribution. In this work, we present an implicit adaptive importance sampling method that applies to complicated distributions which are not available in closed form. The method iteratively matches the moments of a set of Monte Carlo draws to weighted moments based on importance weights. We apply the method to Bayesian leave-one-out cross-validation and show that it performs better than many existing parametric adaptive importance sampling methods while being computationally inexpensive.

Download Full-text

A Frequency Pattern Mining Model Based on Deep Neural Network for Real-Time Classification of Heart Conditions

Healthcare ◽

10.3390/healthcare8030234 ◽

2020 ◽

Vol 8 (3) ◽

pp. 234 ◽

Cited By ~ 3

Author(s):

Hyun Yoo ◽

Soyoung Han ◽

Kyungyong Chung

Keyword(s):

Neural Network ◽

Big Data ◽

Fourier Transform ◽

Fast Fourier Transform ◽

Real Time ◽

Normal Control ◽

Input Data ◽

Deep Neural Network ◽

Pattern Mining ◽

F Measure

Recently, a massive amount of big data of bioinformation is collected by sensor-based IoT devices. The collected data are also classified into different types of health big data in various techniques. A personalized analysis technique is a basis for judging the risk factors of personal cardiovascular disorders in real-time. The objective of this paper is to provide the model for the personalized heart condition classification in combination with the fast and effective preprocessing technique and deep neural network in order to process the real-time accumulated biosensor input data. The model can be useful to learn input data and develop an approximation function, and it can help users recognize risk situations. For the analysis of the pulse frequency, a fast Fourier transform is applied in preprocessing work. With the use of the frequency-by-frequency ratio data of the extracted power spectrum, data reduction is performed. To analyze the meanings of preprocessed data, a neural network algorithm is applied. In particular, a deep neural network is used to analyze and evaluate linear data. A deep neural network can make multiple layers and can establish an operation model of nodes with the use of gradient descent. The completed model was trained by classifying the ECG signals collected in advance into normal, control, and noise groups. Thereafter, the ECG signal input in real time through the trained deep neural network system was classified into normal, control, and noise. To evaluate the performance of the proposed model, this study utilized a ratio of data operation cost reduction and F-measure. As a result, with the use of fast Fourier transform and cumulative frequency percentage, the size of ECG reduced to 1:32. According to the analysis on the F-measure of the deep neural network, the model had 83.83% accuracy. Given the results, the modified deep neural network technique can reduce the size of big data in terms of computing work, and it is an effective system to reduce operation time.

Download Full-text

Teaching research on college English translation in the era of big data

International Journal of Electrical Engineering Education ◽

10.1177/0020720920984316 ◽

2021 ◽

pp. 002072092098431

Author(s):

Yue Liu ◽

Hongyan Bai

Keyword(s):

Big Data ◽

Learning Theory ◽

English Translation ◽

Design Theory ◽

Colleges And Universities ◽

Translation Process ◽

Constructivist Learning Theory ◽

College English ◽

Teaching Time ◽

Better Than

With the development of the big data era and the opening of translation majors in colleges and universities, translation teaching is gradually receiving attention. However, there are still many problems in the training of translators in colleges and universities in terms of teachers, teaching time and teaching mode. In the context of the era of big data, this article uses questionnaires and data analysis, starting from the PACTE translation ability model, combined with constructivist learning theory, blended learning theory, and instructional design theory to analyze the problems of undergraduate translation ability. This article conducts a questionnaire survey on the 2018 students of XX University’s a major, and analyzes their English scores. Students’ bilingual ability is weak, and it is difficult to consider translation under the influence of context in the translation process; their strategic ability is not ideal, and they lack the ability to solve problems when they encounter specific translation problems. The English performance of the experimental class students who have undergone English translation teaching for one semester is significantly better than the control class students who have not received English translation teaching. Teachers can combine teaching theories to design English translation teaching and cultivate students’ awareness of comparative analysis in English learning. Teachers can cultivate students’ English thinking ability, promote them to master English better, and help them improve their English application ability.

Download Full-text