Imbalanced Class handling and Classification on Educational Dataset

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.

Download Full-text

A Deep Analysis of the Precision Formula for Imbalanced Class Distribution

International Journal of Machine Learning and Computing ◽

10.7763/ijmlc.2014.v4.447 ◽

2014 ◽

Vol 4 (5) ◽

pp. 417-422 ◽

Cited By ~ 3

Author(s):

Gabriel Kofi Armah ◽

Guangchun Luo ◽

Ke Qin

Keyword(s):

Class Distribution ◽

Imbalanced Class ◽

Imbalanced Class Distribution

Download Full-text

Classification of Questions Based on Difficulty Levels using Support Vector Machine and Naïve Bayes Algorithms for Imbalanced Class

10.1109/ic2ie53219.2021.9649149 ◽

2021 ◽

Author(s):

Danny Naufal Pratama ◽

Oktariani Nurul Pratiwi ◽

Edi Sutoyo

Keyword(s):

Support Vector Machine ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Imbalanced Class

Download Full-text

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Genes ◽

10.3390/genes11070717 ◽

2020 ◽

Vol 11 (7) ◽

pp. 717

Author(s):

Garba Abdulrauf Sharifai ◽

Zurinahni Zainol

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Imbalanced Data ◽

High Dimensional ◽

Data Sets ◽

Biomedical Data ◽

Data Set ◽

Grasshopper Optimization Algorithm ◽

Imbalanced Class ◽

Grasshopper Optimization

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

Download Full-text

Hybrid resampling to handle imbalanced class on classification of student performance in classroom

2017 1st International Conference on Informatics and Computational Sciences (ICICoS) ◽

10.1109/icicos.2017.8276363 ◽

2017 ◽

Cited By ~ 1

Author(s):

Yoga Pristyanto ◽

Noor Akhmad Setiawan ◽

Igi Ardiyanto

Keyword(s):

Student Performance ◽

Imbalanced Class ◽

Hybrid Resampling

Download Full-text

Imbalanced Class Learning in Epigenetics

Journal of Computational Biology ◽

10.1089/cmb.2014.0008 ◽

2014 ◽

Vol 21 (7) ◽

pp. 492-507 ◽

Cited By ~ 6

Author(s):

M. Muksitul Haque ◽

Michael K. Skinner ◽

Lawrence B. Holder

Keyword(s):

Imbalanced Class

Download Full-text

Entropy-Based Classifier Enhancement to Handle Imbalanced Class Problem

Procedia Computer Science ◽

10.1016/j.procs.2017.01.176 ◽

2017 ◽

Vol 104 ◽

pp. 586-591 ◽

Cited By ~ 3

Author(s):

Arnis Kirshners ◽

Sergei Parshutin ◽

Henrihs Gorskis

Keyword(s):

Imbalanced Class

Download Full-text

COVID-19 County Level Severity Classification with Cl Imbalanced Class: A NearMiss Under-sampling Approach

10.1101/2021.05.21.21257603 ◽

2021 ◽

Author(s):

Timothy Oladunni ◽

Sourou Tossou ◽

Yayehyrad Haile ◽

Adonias Kidane

Keyword(s):

Class Imbalance ◽

County Level ◽

Policy Makers ◽

Ensemble Models ◽

Class A ◽

Proposed Model ◽

Under Sampling ◽

Imbalanced Class ◽

Severity Of The Disease ◽

Sampling Approach

COVID-19 pandemic that broke out in the late 2019 has spread across the globe. The disease has infected millions of people. Thousands of lives have been lost. The momentum of the disease has been slowed by the introduction of vaccine. However, some countries are still recording high number of casualties. The focus of this work is to design, develop and evaluate a machine learning county level COVID-19 severity classifier. The proposed model will predict severity of the disease in a county into low, moderate, or high. Policy makers will find the work useful in the distribution of vaccines. Four learning algorithms (two ensembles and two non-ensembles) were trained and evaluated. Class imbalance was addressed using NearMiss under-sampling of the majority classes. The result of our experiment shows that the ensemble models outperformed the non-ensemble models by a considerable margin.

Download Full-text

Explanation and Prediction of Clinical Data with Imbalanced Class Distribution based on Pattern Discovery and Disentanglement

10.21203/rs.3.rs-28409/v1 ◽

2020 ◽

Author(s):

Peiyuan Zhou ◽

Andrew K.C. Wong

Keyword(s):

Data Analysis ◽

Clinical Data ◽

Pattern Discovery ◽

Synthetic Data ◽

General Setting ◽

Clinical Practices ◽

Class Distribution ◽

Clinical Data Analysis ◽

Imbalanced Class ◽

To Come

Abstract Background Statistical data analysis, especially the advanced machine learning (ML) methods, have attracted considerable interest and application in clinical practices. First, the interpretability of the diagnostic/prognostic results will bring confidence to doctors, patients and their relatives in therapeutics and clinical practice. Furthermore, from the clinical aspect, when the datasets are imbalanced in diagnostic categories, the ordinary ML methods might produce results overwhelmed by the majority classes diminishing prediction accuracy. Hence, it is desirable to have a method that could produce explicit transparent and interpretable results in decision-making, even for data with imbalanced groups.Methods In order to interpret the clinical patterns and conduct diagnostic prediction of patients, we present our new method, Pattern Discovery and Disentanglement for Clinical Data Analysis (cPDD), which is able to discover patterns (correlated traits/indicants) and use them to classify clinical data even if the class distribution is imbalanced. In the most general setting, a relational dataset is a large table such that each column represents an attribute (trait/indicant), each row contains a set of attribute values (AVs) of an entity (patient). Compared to the existing pattern discovery approaches, cPDD can discover a small and succinct set of statistically significant high-order patterns from clinical data for interpreting and predicting the disease class of the patients even for small and rare groups.Results Experiments on synthetic and thoracic clinical dataset showed that cPDD can 1) discover fewer patterns compared to other existing pattern discovery methods; 2) allow the users to interpret succinct sets of patterns coming from uncorrelated sources, even the groups are rare/small; and 3) obtain better performance in prediction compared to other interpretable classification approaches.Conclusions In conclusion, cPDD discovers fewer patterns with greater comprehensive coverage to improve the interpretability of patterns discovered. Experimental results on synthetic data validated that cPDD discover all patterns implanted in the data, display them precisely and succinctly with statistical support for interpretation and prediction, a capability which the traditional ML methods lack. The success of cPDD as a novel explainable method in solving the imbalanced class problem shows its great potential to clinical data analysis for years to come.

Download Full-text