Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.

Download Full-text

Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Information ◽

10.3390/info12080286 ◽

2021 ◽

Vol 12 (8) ◽

pp. 286

Author(s):

Barbara Pes

Keyword(s):

Feature Selection ◽

Learning Community ◽

Learning Strategies ◽

Hybrid Approach ◽

Real Life ◽

Class Imbalance ◽

Research Area ◽

High Dimensional ◽

Imbalance Learning ◽

Feature Selection Techniques

Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.

Download Full-text

Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest

2017 International Conference on Smart Grid and Electrical Automation (ICSGEA) ◽

10.1109/icsgea.2017.158 ◽

2017 ◽

Cited By ~ 2

Author(s):

Min Zhu ◽

Bo Su ◽

Gangmin Ning

Keyword(s):

Feature Selection ◽

Random Forest ◽

Imbalanced Data ◽

Data Classification ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Imbalanced Data Classification

Download Full-text

Clustering-guided particle swarm feature selection algorithm for high-dimensional imbalanced data with missing values

IEEE Transactions on Evolutionary Computation ◽

10.1109/tevc.2021.3106975 ◽

2021 ◽

pp. 1-1

Author(s):

Yong Zhang ◽

Yan-hu Wang ◽

Dun-wei Gong ◽

Xiao-yan Sun

Keyword(s):

Feature Selection ◽

Missing Values ◽

Particle Swarm ◽

Imbalanced Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm

Download Full-text

A Measure Optimized Cost-Sensitive Learning Framework for Imbalanced Data Classification

Advances in Data Mining and Database Management - Biologically-Inspired Techniques for Knowledge Discovery and Data Mining ◽

10.4018/978-1-4666-6078-6.ch003 ◽

2014 ◽

pp. 48-75 ◽

Cited By ~ 2

Author(s):

Peng Cao ◽

Osmar Zaiane ◽

Dazhe Zhao

Keyword(s):

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Feature Subset ◽

Cost Sensitive Learning ◽

Intrinsic Parameters ◽

Real World Problem ◽

Benchmark Datasets ◽

Feed Forward Neural Networks

Class imbalance is one of the challenging problems for machine-learning in many real-world applications. Many methods have been proposed to address and attempt to solve the problem, including sampling and cost-sensitive learning. The latter has attracted significant attention in recent years to solve the problem, but it is difficult to determine the precise misclassification costs in practice. There are also other factors that influence the performance of the classification including the input feature subset and the intrinsic parameters of the classifier. This chapter presents an effective wrapper framework incorporating the evaluation measure (AUC and G-mean) into the objective function of cost sensitive learning directly to improve the performance of classification by simultaneously optimizing the best pair of feature subset, intrinsic parameters, and misclassification cost parameter. The optimization is based on Particle Swarm Optimization (PSO). The authors use two different common methods, support vector machine and feed forward neural networks, to evaluate the proposed framework. Experimental results on various standard benchmark datasets with different ratios of imbalance and a real-world problem show that the proposed method is effective in comparison with commonly used sampling techniques.

Download Full-text

Ranking Based Unsupervised Feature Selection Methods: An Empirical Comparative Study in High Dimensional Datasets

Advances in Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-04491-6_16 ◽

2018 ◽

pp. 205-218

Author(s):

Saúl Solorio-Fernández ◽

J. Ariel Carrasco-Ochoa ◽

José Fco. Martínez-Trinidad

Keyword(s):

Feature Selection ◽

Comparative Study ◽

High Dimensional ◽

Selection Methods ◽

Unsupervised Feature Selection ◽

High Dimensional Datasets

Download Full-text

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

Genes ◽

10.3390/genes11070717 ◽

2020 ◽

Vol 11 (7) ◽

pp. 717

Author(s):

Garba Abdulrauf Sharifai ◽

Zurinahni Zainol

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Imbalanced Data ◽

High Dimensional ◽

Data Sets ◽

Biomedical Data ◽

Data Set ◽

Grasshopper Optimization Algorithm ◽

Imbalanced Class ◽

Grasshopper Optimization

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

Download Full-text

LICIC: Less Important Components for Imbalanced Multiclass Classification

Information ◽

10.3390/info9120317 ◽

2018 ◽

Vol 9 (12) ◽

pp. 317 ◽

Cited By ~ 5

Author(s):

Vincenzo Dentamaro ◽

Donato Impedovo ◽

Giuseppe Pirlo

Keyword(s):

Gene Expression ◽

Class Imbalance ◽

Imbalanced Data ◽

Multiclass Classification ◽

Cancer Diagnostics ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Or Gene ◽

High Dimensional Datasets

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.

Download Full-text

Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data

BMC Bioinformatics ◽

10.1186/s12859-020-3411-3 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Guang-Hui Fu ◽

Yuan-Jiao Wu ◽

Min-Jie Zong ◽

Jianxin Pan

Keyword(s):

Feature Selection ◽

Imbalanced Data ◽

Hellinger Distance ◽

High Dimensional ◽

Selection For ◽

Sparse Feature Selection

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

SU-CCE: A Novel Feature Selection Approach for Reducing High Dimensionality

10.3233/apc210196 ◽

2021 ◽

Author(s):

A B Pawar ◽

M A Jawale ◽

Ravi Kumar Tirandasu ◽

Saiprasad Potharaju

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Feature Space ◽

Microarray Dataset ◽

Classification Model ◽

High Dimensionality ◽

High Dimensional ◽

Selection Approach ◽

Feature Selection Approach ◽

Careful Investigation

High dimensionality is the serious issue in the preprocessing of data mining. Having large number of features in the dataset leads to several complications for classifying an unknown instance. In a initial dataspace there may be redundant and irrelevant features present, which leads to high memory consumption, and confuse the learning model created with those properties of features. Always it is advisable to select the best features and generate the classification model for better accuracy. In this research, we proposed a novel feature selection approach and Symmetrical uncertainty and Correlation Coefficient (SU-CCE) for reducing the high dimensional feature space and increasing the classification accuracy. The experiment is performed on colon cancer microarray dataset which has 2000 features. The proposed method derived 38 best features from it. To measure the strength of proposed method, top 38 features extracted by 4 traditional filter-based methods are compared with various classifiers. After careful investigation of result, the proposed approach is competing with most of the traditional methods.

Download Full-text