Imbalanced Data Classification Using Cost-Sensitive Support Vector Machine Based on Information Entropy

As a supervised classification algorithm, Support Vector Machine (SVM) has an excellent ability in solving small samples, nonlinear and high dimensional classification problems. However, SVM is inefficient for imbalanced data sets classification. Therefore, a cost sensitive SVM (CSSVM) should be designed for imbalanced data sets classification. This paper proposes a method which constructed CSSVM based on information entropy, and in this method the information entropies of different classes of data set are used to determine the values of penalty factor of CSSVM.

Download Full-text

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

Neural Networks ◽

10.1016/j.neunet.2019.10.016 ◽

2020 ◽

Vol 122 ◽

pp. 289-307 ◽

Cited By ~ 8

Author(s):

Xinmin Tao ◽

Qing Li ◽

Chao Ren ◽

Wenjie Guo ◽

Qing He ◽

...

Keyword(s):

Support Vector Machine ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Fuzzy Support Vector Machine ◽

Imbalanced Data Sets ◽

Class Probability

Download Full-text

Classification with Local Clustering in Imbalanced Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.219-220.151 ◽

2011 ◽

Vol 219-220 ◽

pp. 151-155 ◽

Cited By ~ 2

Author(s):

Hua Ji ◽

Hua Xiang Zhang

Keyword(s):

Data Distribution ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Imbalanced Data Sets ◽

Local Clustering ◽

Rare Class ◽

Novel Method ◽

The Cost

In many real-world domains, learning from imbalanced data sets is always confronted. Since the skewed class distribution brings the challenge for traditional classifiers because of much lower classification accuracy on rare classes, we propose the novel method on classification with local clustering based on the data distribution of the imbalanced data sets to solve this problem. At first, we divide the whole data set into several data groups based on the data distribution. Then we perform local clustering within each group both on the normal class and the disjointed rare class. For rare class, the subsequent over-sampling is employed according to the different rates. At last, we apply support vector machines (SVMS) for classification, by means of the traditional tactic of the cost matrix to enhance the classification accuracies. The experimental results on several UCI data sets show that this method can produces much higher prediction accuracies on the rare class than state-of-art methods.

Download Full-text

Learning from imbalanced data sets with a Min-Max modular support vector machine

Frontiers of Electrical and Electronic Engineering in China ◽

10.1007/s11460-011-0127-1 ◽

2011 ◽

Vol 6 (1) ◽

pp. 56-71 ◽

Cited By ~ 8

Author(s):

Lu Bao-Liang ◽

Wang Xiao-Lin ◽

Yang Yang ◽

Zhao Hai

Keyword(s):

Support Vector Machine ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Learning Imbalanced Data Sets with a Min-Max Modular Support Vector Machine

2007 International Joint Conference on Neural Networks ◽

10.1109/ijcnn.2007.4371209 ◽

2007 ◽

Cited By ~ 1

Author(s):

Zhi-Fei Ye ◽

Bao-Liang Lu

Keyword(s):

Support Vector Machine ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Variant of Data Particle Geometrical Divide for Imbalanced Data Sets Classification by the Example of Occupancy Detection

Applied Sciences ◽

10.3390/app11114970 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4970

Author(s):

Łukasz Rybak ◽

Janusz Dudczyk

Keyword(s):

Learning Algorithm ◽

Imbalanced Data ◽

Evaluation Process ◽

Unbalanced Data ◽

Data Sets ◽

Classification Problems ◽

Mass Model ◽

Data Set ◽

Imbalanced Data Sets ◽

Occupancy Detection

The history of gravitational classification started in 1977. Over the years, the gravitational approaches have reached many extensions, which were adapted into different classification problems. This article is the next stage of the research concerning the algorithms of creating data particles by their geometrical divide. In the previous analyses it was established that the Geometrical Divide (GD) method outperforms the algorithm creating the data particles based on classes by a compound of 1 ÷ 1 cardinality. This occurs in the process of balanced data sets classification, in which class centroids are close to each other and the groups of objects, described by different labels, overlap. The purpose of the article was to examine the efficiency of the Geometrical Divide method in the unbalanced data sets classification, by the example of real case-occupancy detecting. In addition, in the paper, the concept of the Unequal Geometrical Divide (UGD) was developed. The evaluation of approaches was conducted on 26 unbalanced data sets-16 with the features of Moons and Circles data sets and 10 created based on real occupancy data set. In the experiment, the GD method and its unbalanced variant (UGD) as well as the 1CT1P approach, were compared. Each method was combined with three data particle mass determination algorithms-n-Mass Model (n-MM), Stochastic Learning Algorithm (SLA) and Bath-update Algorithm (BLA). k-fold cross validation method, precision, recall, F-measure, and number of used data particles were applied in the evaluation process. Obtained results showed that the methods based on geometrical divide outperform the 1CT1P approach in the imbalanced data sets classification. The article’s conclusion describes the observations and indicates the potential directions of further research and development of methods, which concern creating the data particle through its geometrical divide.

Download Full-text

SecProMTB: Support Vector Machine‐Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis

PROTEOMICS ◽

10.1002/pmic.201900007 ◽

2019 ◽

Vol 19 (17) ◽

pp. 1900007 ◽

Cited By ~ 15

Author(s):

Chaolu Meng ◽

Leyi Wei ◽

Quan Zou

Keyword(s):

Support Vector Machine ◽

Mycobacterium Tuberculosis ◽

Imbalanced Data ◽

Support Vector ◽

Secretory Proteins ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Imbalanced Data Set CSVM Classification Method Based on Cluster Boundary Sampling

Mathematical Problems in Engineering ◽

10.1155/2016/1540628 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Peng Li ◽

Tian-ge Liang ◽

Kai-hui Zhang

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Data Set ◽

Imbalanced Data Sets ◽

Boundary Density ◽

Cluster Boundary ◽

Penalty Factor ◽

Density Clustering ◽

Scientific Significance ◽

Density Threshold

This paper creatively proposes a cluster boundary sampling method based on density clustering to solve the problem of resampling in IDS classification and verify its effectiveness experimentally. We use the clustering density threshold and the boundary density threshold to determine the cluster boundaries, in order to guide the process of resampling more scientifically and accurately. Then, we adopt the penalty factor to regulate the data imbalance effect on SVM classification algorithm. The achievements and scientific significance of this paper do not propose the best classifier or solution of imbalanced data set and just verify the validity and stability of proposed IDS resampling method. Experiments show that our method acquires obvious promotion effect in various imbalanced data sets.

Download Full-text

SAFETY RISK EVALUATIONS OF DEEP FOUNDATION CONSTRUCTION SCHEMES BASED ON IMBALANCED DATA SETS

Journal of Civil Engineering and Management ◽

10.3846/jcem.2020.12321 ◽

2020 ◽

Vol 26 (4) ◽

pp. 380-395

Author(s):

Peisong Gong ◽

Haixiang Guo ◽

Yuanyue Huang ◽

Shengyu Guo

Keyword(s):

Imbalanced Data ◽

Classification Algorithm ◽

Support Vector ◽

Data Sets ◽

Foundation Engineering ◽

Deep Foundation ◽

Data Set ◽

Safety Risk ◽

Imbalanced Data Sets ◽

Foundation Construction

Safety risk evaluations of deep foundation construction schemes are important to ensure safety. However, the amount of knowledge on these evaluations is large, and the historical data of deep foundation engineering is imbalanced. Some adverse factors influence the quality and efficiency of evaluations using traditional manual evaluation tools. Machine learning guarantees the quality of imbalanced data classifications. In this study, three strategies are proposed to improve the classification accuracy of imbalanced data sets. First, data set information redundancy is reduced using a binary particle swarm optimization algorithm. Then, a classification algorithm is modified using an Adaboost-enhanced support vector machine classifier. Finally, a new classification evaluation standard, namely, the area under the ROC curve, is adopted to ensure the classifier to be impartial to the minority. A transverse comparison experiment using multiple classification algorithms shows that the proposed integrated classification algorithm can overcome difficulties associated with correctly classifying minority samples in imbalanced data sets. The algorithm can also improve construction safety management evaluations, relieve the pressure from the lack of experienced experts accompanying rapid infrastructure construction, and facilitate knowledge reuse in the field of architecture, engineering, and construction.

Download Full-text

Software Defect Prediction in Imbalanced Data Sets Using Unbiased Support Vector Machine

Lecture Notes in Electrical Engineering - Information Science and Applications ◽

10.1007/978-3-662-46578-3_110 ◽

2015 ◽

pp. 923-931 ◽

Cited By ~ 6

Author(s):

Teerawit Choeikiwong ◽

Peerapon Vateekul

Keyword(s):

Support Vector Machine ◽

Imbalanced Data ◽

Defect Prediction ◽

Support Vector ◽

Data Sets ◽

Software Defect Prediction ◽

Imbalanced Data Sets ◽

Software Defect

Download Full-text

Imbalanced Data Detection Kernel Method in Closed Systems

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3652 ◽

2013 ◽

Vol 756-759 ◽

pp. 3652-3658

Author(s):

You Li Lu ◽

Jun Luo

Keyword(s):

Kernel Methods ◽

Kernel Method ◽

Imbalanced Data ◽

Data Detection ◽

Data Sets ◽

System Call ◽

Data Set ◽

Imbalanced Data Sets ◽

Lower Complexity ◽

Closed Systems

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.

Download Full-text