A Measure Optimized Cost-Sensitive Learning Framework for Imbalanced Data Classification

Class imbalance is one of the challenging problems for machine-learning in many real-world applications. Many methods have been proposed to address and attempt to solve the problem, including sampling and cost-sensitive learning. The latter has attracted significant attention in recent years to solve the problem, but it is difficult to determine the precise misclassification costs in practice. There are also other factors that influence the performance of the classification including the input feature subset and the intrinsic parameters of the classifier. This chapter presents an effective wrapper framework incorporating the evaluation measure (AUC and G-mean) into the objective function of cost sensitive learning directly to improve the performance of classification by simultaneously optimizing the best pair of feature subset, intrinsic parameters, and misclassification cost parameter. The optimization is based on Particle Swarm Optimization (PSO). The authors use two different common methods, support vector machine and feed forward neural networks, to evaluate the proposed framework. Experimental results on various standard benchmark datasets with different ratios of imbalance and a real-world problem show that the proposed method is effective in comparison with commonly used sampling techniques.

Download Full-text

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

An Empirical Study of Boosting Methods on Severely Imbalanced Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.2510 ◽

2014 ◽

Vol 513-517 ◽

pp. 2510-2513 ◽

Cited By ~ 1

Author(s):

Xu Ying Liu

Keyword(s):

Empirical Study ◽

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Real World Applications ◽

Under Sampling ◽

The Difference ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

F Measure

Nowadays there are large volumes of data in real-world applications, which poses great challenge to class-imbalance learning: the large amount of the majority class examples and severe class-imbalance. Previous studies on class-imbalance learning mainly focused on relatively small or moderate class-imbalance. In this paper we conduct an empirical study to explore the difference between learning with small or moderate class-imbalance and learning with severe class-imbalance. The experimental results show that: (1) Traditional methods cannot handle severe class-imbalance effectively. (2) AUC, G-mean and F-measure can be very inconsistent for severe class-imbalance, which seldom appears when class-imbalance is moderate. And G-mean is not appropriate for severe class-imbalance learning because it is not sensitive to the change of imbalance ratio. (3) When AUC and G-mean are evaluation metrics, EasyEnsemble is the best method, followed by BalanceCascade and under-sampling. (4) A little under-full balance is better for under-sampling to handle severe class-imbalance. And it is important to handle false positives when design methods for severe class-imbalance.

Download Full-text

DATA IMBALANCE IN LANDSLIDE SUSCEPTIBILITY ZONATION: UNDER-SAMPLING FOR CLASS-IMBALANCE LEARNING

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w11-51-2020 ◽

2020 ◽

Vol XLII-3/W11 ◽

pp. 51-57

Author(s):

S. K. Gupta ◽

M. Jhunjhunwalla ◽

A. Bhardwaj ◽

D. P. Shukla

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Class Imbalance ◽

Imbalanced Data ◽

Training Data ◽

Support Vector ◽

Fisher Discriminant Analysis ◽

Minority Class ◽

Data Imbalance ◽

Artificial Neural

Abstract. Machine learning methods such as artificial neural network, support vector machine etc. require a large amount of training data, however, the number of landslide occurrences are limited in a study area. The limited number of landslides leads to a small number of positive class pixels in the training data. On contrary, the number of non-landslide pixels (negative class pixels) are enormous in numbers. This under-represented data and severe class distribution skew create a data imbalance for learning algorithms and suboptimal models, which are biased towards the majority class (non-landslide pixels) and have low performance on the minority class (landslide pixels).In this work, we have used two algorithms namely EasyEnsemble and BalanceCascade for balancing the data. This balanced data is used with feature selection methods such as fisher discriminant analysis (FDA), logistic regression (LR) and artificial neural network (ANN) to generate LSZ maps The results of the study show that ANN with balanced data has major improvements in preparation of susceptibility maps over imbalanced data, where as the LR method is ill-effected by data balancing algorithms. The FDA does not show significant changes between balanced and imbalanced data.

Download Full-text

Dynamic financial distress prediction based on class-imbalanced data batches

International Journal of Financial Engineering ◽

10.1142/s2424786321500262 ◽

2021 ◽

pp. 2150026

Author(s):

Jie Sun ◽

Xin Liu ◽

Wenguo Ai ◽

Qianyuan Tian

Keyword(s):

Financial Distress ◽

Time Window ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Support Vector ◽

Multiple Discriminant Analysis ◽

Financial Distress Prediction ◽

Distress Prediction

This study proposes two approaches for dynamic financial distress prediction (FDP) based on class-imbalanced data batches by considering both concept drift and class imbalance. One is based on sliding time window and synthetic minority over-sampling technique (SMOTE) and the other is based on sliding time window and majority class partition. Support vector machine, multiple discriminant analysis (MDA) and logistic regression are used as base classifiers in the experiments on a real-world dataset. The results indicate that the two approaches perform better than the pure dynamic FDP (DFDP) models without class imbalance processing and the static FDP models either with or without class imbalance processing.

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

A Novel Approach to Maximize G-mean in Nonstationary Data with Recurrent Imbalance Shifts

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/1/12 ◽

2020 ◽

Vol 18 (1) ◽

pp. 103-113

Keyword(s):

State Of The Art ◽

Class Imbalance ◽

Imbalanced Data ◽

Streaming Data ◽

Novel Approach ◽

Benchmark Datasets ◽

Boosting Algorithms ◽

Nonstationary Data ◽

Test Outcomes

One of the noteworthy difficulties in the classification of nonstationary data is handling data with class imbalance. Imbalanced data possess the characteristics of having a lot of samples of one class than the other. It, thusly, results in the biased accuracy of a classifier in favour of a majority class. Streaming data may have inherent imbalance resulting from the nature of dataspace or extrinsic imbalance due to its nonstationary environment. In streaming data, timely varying class priors may lead to a shift in imbalance ratio. The researchers have contemplated ensemble learning, online learning, issue of class imbalance and cost-sensitive algorithms autonomously. They have scarcely ever tended to every one of these issues mutually to deal with imbalance shift in nonstationary data. This correspondence shows a novel methodology joining these perspectives to augment G-mean in no stationary data with Recurrent Imbalance Shifts (RIS). This research modifies the state-of-the-art boosting algorithms,1) AdaC2 to get G-mean based Online AdaC2 for Recurrent Imbalance Shifts (GOA-RIS) and AGOA-RIS (Ageing and G-mean based Online AdaC2 for Recurrent Imbalance Shifts), and 2) CSB2 to get G-mean based Online CSB2 for Recurrent Imbalance Shifts (GOC-RIS) and Ageing and G-mean based Online CSB2 for Recurrent Imbalance Shifts (AGOC-RIS). The study has empirically and statistically analysed the performances of the proposed algorithms and Online AdaC2 (OA) and Online CSB2 (OC) algorithms using benchmark datasets. The test outcomes demonstrate that the proposed algorithms globally beat the performances of OA and OC

Download Full-text

Semi-Automatization of Support Vector Machines to Map Lithium (Li) Bearing Pegmatites

Remote Sensing ◽

10.3390/rs12142319 ◽

2020 ◽

Vol 12 (14) ◽

pp. 2319 ◽

Cited By ~ 3

Author(s):

Joana Cardoso-Fernandes ◽

Ana C. Teodoro ◽

Alexandre Lima ◽

Encarnación Roda-Robles

Keyword(s):

Negative Impact ◽

Methodological Approach ◽

Class Imbalance ◽

Imbalanced Data ◽

Region Of Interest ◽

Support Vector ◽

Sensing Applications ◽

Great Performance ◽

Vector Machines ◽

The Impact

Machine learning (ML) algorithms have shown great performance in geological remote sensing applications. The study area of this work was the Fregeneda–Almendra region (Spain–Portugal) where the support vector machine (SVM) was employed. Lithium (Li)-pegmatite exploration using satellite data presents some challenges since pegmatites are, by nature, small, narrow bodies. Consequently, the following objectives were defined: (i) train several SVM’s on Sentinel-2 images with different parameters to find the optimal model; (ii) assess the impact of imbalanced data; (iii) develop a successful methodological approach to delineate target areas for Li-exploration. Parameter optimization and model evaluation was accomplished by a two-staged grid-search with cross-validation. Several new methodological advances were proposed, including a region of interest (ROI)-based splitting strategy to create the training and test subsets, a semi-automatization of the classification process, and the application of a more innovative and adequate metric score to choose the best model. The proposed methodology obtained good results, identifying known Li-pegmatite occurrences as well as other target areas for Li-exploration. Also, the results showed that the class imbalance had a negative impact on the SVM performance since known Li-pegmatite occurrences were not identified. The potentials and limitations of the methodology proposed are highlighted and its applicability to other case studies is discussed.

Download Full-text

m5CPred-SVM: A Novel Method for Predicting m5C Sites of RNA

10.21203/rs.3.rs-39526/v1 ◽

2020 ◽

Author(s):

Xiao Chen ◽

Yi Xiong ◽

Yinbo Liu ◽

Yuqing Chen ◽

Shoudong Bi ◽

...

Keyword(s):

Cell Fate ◽

Prediction Accuracy ◽

Cytosine Methylation ◽

Computational Method ◽

Selection Strategy ◽

Support Vector ◽

Feature Subset ◽

Accurate Identification ◽

Benchmark Datasets ◽

Optimal Feature Subset

Abstract Background: As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functionssuch as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA,researcherscanbetter understandthe exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost.However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement.Results: In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVMoffered substantially higher prediction accuracy thanpreviously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites.Conclusion: In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species.The result shows that our model outperformed the existing state-of-art models.Our model is available for users through a web serverat http://zhulab.ahu.edu.cn/m5CPred-SVM.

Download Full-text

Selecting the Optimal Combination Model of FSSVM for the Imbalance Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/539430 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6

Author(s):

Chuandong Qin ◽

Huixia Zhao

Keyword(s):

Support Vector Machines ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Smooth Functions ◽

Separating Hyperplane ◽

Vector Machines ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Imbalanced Data Learning

Imbalanced data learning is one of the most active and important fields in machine learning research. The existing class imbalance learning methods can make Support Vector Machines (SVMs) less sensitive to class imbalance; they still suffer from the disturbance of outliers and noise present in the datasets. A kind of Fuzzy Smooth Support Vector Machines (FSSVMs) are proposed based on the Smooth Support Vector Machine (SSVM) of O. L. Mangasarian. SSVM can be computed by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm or the Newton-Armijo algorithm easily. Two kinds of fuzzy memberships and three smooth functions can be chosen in the algorithms. The fuzzy memberships consider the contribution rate of each sample to the optimal separating hyperplane. The polynomial smooth functions can make the optimization problem more accurate at the inflection point. Those changes play the active effects on trials. The results of the experiments show that the FSSVMs can gain the better accuracy and the shorter time than the SSVMs and some of the other methods.

Download Full-text

New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification

The Scientific World JOURNAL ◽

10.1155/2014/536434 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 7

Author(s):

Xiaoqing Gu ◽

Tongguang Ni ◽

Hongyuan Wang

Keyword(s):

Support Vector Machine ◽

Real World ◽

Class Imbalance ◽

Support Vector ◽

Manifold Regularization ◽

Medical Database ◽

Class Imbalance Problem ◽

Fuzzy Support Vector Machine ◽

Imbalance Problem ◽

Misclassification Costs

In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.

Download Full-text