probability of overfitting Latest Research Papers

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

Electronics ◽

10.3390/electronics11020228 ◽

2022 ◽

Vol 11 (2) ◽

pp. 228

Author(s):

Ahmad B. Hassanat ◽

Ahmad S. Tarawneh ◽

Samer Subhi Abed ◽

Ghada Awad Altarawneh ◽

Malek Alrashidi ◽

...

Keyword(s):

Machine Learning ◽

Linear Time ◽

Class Imbalance ◽

Data Partitioning ◽

Majority Voting ◽

Random Data ◽

Imbalanced Datasets ◽

Resampling Methods ◽

Voting Rule ◽

Probability Of Overfitting

Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.

MULTILABEL OVER-SAMPLING AND UNDER-SAMPLING WITH CLASS ALIGNMENT FOR IMBALANCED MULTILABEL TEXT CLASSIFICATION

Journal of Information and Communication Technology ◽

10.32890/jict2021.20.3.6 ◽

2021 ◽

Vol 20 (Number 3) ◽

pp. 423-456

Author(s):

Adil Yaseen Taha ◽

Sabrina Tiun ◽

Abdul Hadi Abd Rahman ◽

Ali Sabah

Keyword(s):

Text Classification ◽

Training Set ◽

Average Precision ◽

Size Classes ◽

Resampling Method ◽

Benchmark Datasets ◽

Under Sampling ◽

Probability Of Overfitting ◽

Multiple Labelling ◽

Better Than

Simultaneous multiple labelling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalanced entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve the class imbalanced problem. However, these approaches have several drawbacks; the under-sampling is likely to dispose of useful data, whereas the over-sampling can heighten the probability of overfitting. Therefore, a new method that can avoid discarding useful data and overfitting problems is needed. This study proposes a method to tackle the class imbalanced problem by combining multilabel over-sampling and under-sampling with class alignment (ML-OUSCA). In the proposed ML-OUSCA, instead of using all the training instances, it draws a new training set by over-sampling small size classes and under-sampling big size classes. To evaluate our proposed ML-OUSCA, evaluation metrics of average precision, average recall and average F-measure on three benchmark datasets, namely, Reuters-21578, Bibtex, and Enron datasets, were performed. Experimental results showed that the proposed ML-OUSCA outperformed the chosen baseline random resampling approaches; K-means SMOTE and KNN-US. Thus, based on the results, we can conclude that designing a resampling method based on the class imbalanced together with class alignment will improve multilabel classification even better than just the random resampling method.

Cover-based combinatorial bounds on probability of overfitting

Doklady Mathematics ◽

10.1134/s1064562414020136 ◽

2014 ◽

Vol 89 (2) ◽

pp. 185-187

Author(s):

A. I. Frey ◽

I. O. Tolstikhin

Keyword(s):

Probability Of Overfitting

Exact estimates of the probability of overfitting for multidimensional modeling families of algorithms

Pattern Recognition and Image Analysis ◽

10.1134/s1054661811010032 ◽

2011 ◽

Vol 21 (1) ◽

pp. 52-65 ◽

Cited By ~ 1

Author(s):

P. V. Botov

Keyword(s):

Multidimensional Modeling ◽

Probability Of Overfitting

Exact combinatorial bounds on the probability of overfitting for empirical risk minimization

Pattern Recognition and Image Analysis ◽

10.1134/s105466181003003x ◽

2010 ◽

Vol 20 (3) ◽

pp. 269-285 ◽

Cited By ~ 9

Author(s):

K. V. Vorontsov

Keyword(s):

Empirical Risk Minimization ◽

Risk Minimization ◽

Empirical Risk ◽

Probability Of Overfitting

Tight bounds for the probability of overfitting

Doklady Mathematics ◽

10.1134/s1064562409060032 ◽

2009 ◽

Vol 80 (3) ◽

pp. 793-796 ◽

Cited By ~ 6

Author(s):

K. V. Vorontsov

Keyword(s):

Probability Of Overfitting

Splitting and similarity phenomena in the sets of classifiers and their effect on the probability of overfitting

Pattern Recognition and Image Analysis ◽

10.1134/s1054661809030055 ◽

2009 ◽

Vol 19 (3) ◽

pp. 412-420 ◽

Cited By ~ 10

Author(s):

K. V. Vorontsov

Keyword(s):

Probability Of Overfitting

probability of overfitting
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

MULTILABEL OVER-SAMPLING AND UNDER-SAMPLING WITH CLASS ALIGNMENT FOR IMBALANCED MULTILABEL TEXT CLASSIFICATION

Cover-based combinatorial bounds on probability of overfitting

Exact estimates of the probability of overfitting for multidimensional modeling families of algorithms

Exact combinatorial bounds on the probability of overfitting for empirical risk minimization

Tight bounds for the probability of overfitting

Splitting and similarity phenomena in the sets of classifiers and their effect on the probability of overfitting

Export Citation Format

probability of overfittingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

MULTILABEL OVER-SAMPLING AND UNDER-SAMPLING WITH CLASS ALIGNMENT FOR IMBALANCED MULTILABEL TEXT CLASSIFICATION

Cover-based combinatorial bounds on probability of overfitting

Exact estimates of the probability of overfitting for multidimensional modeling families of algorithms

Exact combinatorial bounds on the probability of overfitting for empirical risk minimization

Tight bounds for the probability of overfitting

Splitting and similarity phenomena in the sets of classifiers and their effect on the probability of overfitting

probability of overfitting
Recently Published Documents