Classification with Local Clustering in Imbalanced Data Sets

In many real-world domains, learning from imbalanced data sets is always confronted. Since the skewed class distribution brings the challenge for traditional classifiers because of much lower classification accuracy on rare classes, we propose the novel method on classification with local clustering based on the data distribution of the imbalanced data sets to solve this problem. At first, we divide the whole data set into several data groups based on the data distribution. Then we perform local clustering within each group both on the normal class and the disjointed rare class. For rare class, the subsequent over-sampling is employed according to the different rates. At last, we apply support vector machines (SVMS) for classification, by means of the traditional tactic of the cost matrix to enhance the classification accuracies. The experimental results on several UCI data sets show that this method can produces much higher prediction accuracies on the rare class than state-of-art methods.

Download Full-text

SAFETY RISK EVALUATIONS OF DEEP FOUNDATION CONSTRUCTION SCHEMES BASED ON IMBALANCED DATA SETS

Journal of Civil Engineering and Management ◽

10.3846/jcem.2020.12321 ◽

2020 ◽

Vol 26 (4) ◽

pp. 380-395

Author(s):

Peisong Gong ◽

Haixiang Guo ◽

Yuanyue Huang ◽

Shengyu Guo

Keyword(s):

Imbalanced Data ◽

Classification Algorithm ◽

Support Vector ◽

Data Sets ◽

Foundation Engineering ◽

Deep Foundation ◽

Data Set ◽

Safety Risk ◽

Imbalanced Data Sets ◽

Foundation Construction

Safety risk evaluations of deep foundation construction schemes are important to ensure safety. However, the amount of knowledge on these evaluations is large, and the historical data of deep foundation engineering is imbalanced. Some adverse factors influence the quality and efficiency of evaluations using traditional manual evaluation tools. Machine learning guarantees the quality of imbalanced data classifications. In this study, three strategies are proposed to improve the classification accuracy of imbalanced data sets. First, data set information redundancy is reduced using a binary particle swarm optimization algorithm. Then, a classification algorithm is modified using an Adaboost-enhanced support vector machine classifier. Finally, a new classification evaluation standard, namely, the area under the ROC curve, is adopted to ensure the classifier to be impartial to the minority. A transverse comparison experiment using multiple classification algorithms shows that the proposed integrated classification algorithm can overcome difficulties associated with correctly classifying minority samples in imbalanced data sets. The algorithm can also improve construction safety management evaluations, relieve the pressure from the lack of experienced experts accompanying rapid infrastructure construction, and facilitate knowledge reuse in the field of architecture, engineering, and construction.

Download Full-text

Imbalanced Data Classification Using Cost-Sensitive Support Vector Machine Based on Information Entropy

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.1756 ◽

2014 ◽

Vol 989-994 ◽

pp. 1756-1761 ◽

Cited By ~ 3

Author(s):

Wei Duan ◽

Liang Jing ◽

Xiang Yang Lu

Keyword(s):

Support Vector Machine ◽

Information Entropy ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Classification Problems ◽

Data Set ◽

Imbalanced Data Sets ◽

Penalty Factor ◽

Imbalanced Data Classification

As a supervised classification algorithm, Support Vector Machine (SVM) has an excellent ability in solving small samples, nonlinear and high dimensional classification problems. However, SVM is inefficient for imbalanced data sets classification. Therefore, a cost sensitive SVM (CSSVM) should be designed for imbalanced data sets classification. This paper proposes a method which constructed CSSVM based on information entropy, and in this method the information entropies of different classes of data set are used to determine the values of penalty factor of CSSVM.

Download Full-text

Imbalanced Data Detection Kernel Method in Closed Systems

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3652 ◽

2013 ◽

Vol 756-759 ◽

pp. 3652-3658

Author(s):

You Li Lu ◽

Jun Luo

Keyword(s):

Kernel Methods ◽

Kernel Method ◽

Imbalanced Data ◽

Data Detection ◽

Data Sets ◽

System Call ◽

Data Set ◽

Imbalanced Data Sets ◽

Lower Complexity ◽

Closed Systems

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.

Download Full-text

Embedding Undersampling Rotation Forest for Imbalanced Problem

Computational Intelligence and Neuroscience ◽

10.1155/2018/6798042 ◽

2018 ◽

Vol 2018 ◽

pp. 1-15 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Xiaoyu Diao ◽

Hongbing Liu

Keyword(s):

Imbalanced Data ◽

Feature Space ◽

Original Data ◽

Training Set ◽

Data Set ◽

Minority Class ◽

Rotation Forest ◽

Novel Method ◽

Individual Classifier ◽

The Cost

Rotation Forest is an ensemble learning approach achieving better performance comparing to Bagging and Boosting through building accurate and diverse classifiers using rotated feature space. However, like other conventional classifiers, Rotation Forest does not work well on the imbalanced data which are characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is often much more expensive than the contrary cases. This paper proposes a novel method called Embedding Undersampling Rotation Forest (EURF) to handle this problem (1) sampling subsets from the majority class and learning a projection matrix from each subset and (2) obtaining training sets by projecting re-undersampling subsets of the original data set to new spaces defined by the matrices and constructing an individual classifier from each training set. For the first method, undersampling is to force the rotation matrix to better capture the features of the minority class without harming the diversity between individual classifiers. With respect to the second method, the undersampling technique aims to improve the performance of individual classifiers on the minority class. The experimental results show that EURF achieves significantly better performance comparing to other state-of-the-art methods.

Download Full-text

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets

Neural Networks ◽

10.1016/j.neunet.2019.10.016 ◽

2020 ◽

Vol 122 ◽

pp. 289-307 ◽

Cited By ~ 8

Author(s):

Xinmin Tao ◽

Qing Li ◽

Chao Ren ◽

Wenjie Guo ◽

Qing He ◽

...

Keyword(s):

Support Vector Machine ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Fuzzy Support Vector Machine ◽

Imbalanced Data Sets ◽

Class Probability

Download Full-text

Exemplar-Based Learning Classifier System with Dynamic Matching Range for Imbalanced Data

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2017.p0868 ◽

2017 ◽

Vol 21 (5) ◽

pp. 868-875

Author(s):

Hiroyasu Matsushima ◽

Keiki Takadama ◽

◽

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Sigmoid Function ◽

Learning Classifier System ◽

Data Set ◽

Imbalanced Data Sets ◽

Dynamic Matching ◽

Learning Classifier ◽

Stable Performance ◽

The Given

In this paper, we propose a method to improve ECS-DMR which enables appropriate output for imbalanced data sets. In order to control generalization of LCS in imbalanced data set, we propose a method of applying imbalance ratio of data set to a sigmoid function, and then, appropriately update the matching range. In comparison with our previous work (ECS-DMR), the proposed method can control the generalization of the appropriate matching range automatically to extract the exemplars that cover the given problem space, wchich consists of imbalanced data set. From the experimental results, it is suggested that the proposed method provides stable performance to imbalanced data set. The effect of the proposed method using the sigmoid function considering the data balance is shown.

Download Full-text

An Instance Selection Algorithm Based on ReliefF

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213019500015 ◽

2019 ◽

Vol 28 (01) ◽

pp. 1950001 ◽

Cited By ~ 2

Author(s):

Zeinab Abbasi ◽

Mohsen Rahmani

Keyword(s):

Missing Values ◽

Imbalanced Data ◽

Jaccard Index ◽

Instance Selection ◽

Data Sets ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Data Set ◽

Imbalanced Data Sets ◽

Numeric Data

Due to the increasing growth of data, many methods are proposed to extract useful data and remove noisy data. Instance selection is one of these methods which selects some instances of a data set and removes others. This paper proposes a new instance selection algorithm based on ReliefF, which is a feature selection algorithm. In the proposed algorithm, based on the Jaccard index, the nearest instances of each class are found for each instance. Then, based on the nearest neighbor’s set, the weight of each instance is calculated. Finally, only instances with more weights are selected. This algorithm can reduce data at a specified rate and have the ability to run parallel on the instances. It can work on a variety of data sets with nominal and numeric data with missing values and is also suitable for working with imbalanced data sets. The proposed algorithm tests on three data sets. Results show that the proposed algorithm can reduce the volume of data, without a significant change in classification accuracy of these datasets.

Download Full-text

SHOCK PHYSICS DATA RECONSTRUCTION USING SUPPORT VECTOR REGRESSION

International Journal of Modern Physics C ◽

10.1142/s0129183106009813 ◽

2006 ◽

Vol 17 (09) ◽

pp. 1313-1325 ◽

Cited By ~ 8

Author(s):

NIKITA A. SAKHANENKO ◽

GEORGE F. LUGER ◽

HANNA E. MAKARUK ◽

JOYSREE B. AUBREY ◽

DAVID B. HOLTKAMP

Keyword(s):

Experimental Data ◽

Support Vector ◽

Data Sets ◽

Shock Physics ◽

Data Set ◽

Velocity Surface ◽

The Cost ◽

Physical Phenomena ◽

Physics Experiments ◽

Data Estimation

This paper considers a set of shock physics experiments that investigate how materials respond to the extremes of deformation, pressure, and temperature when exposed to shock waves. Due to the complexity and the cost of these tests, the available experimental data set is often very sparse. A support vector machine (SVM) technique for regression is used for data estimation of velocity measurements from the underlying experiments. Because of good generalization performance, the SVM method successfully interpolates the experimental data. The analysis of the resulting velocity surface provides more information on the physical phenomena of the experiment. Additionally, the estimated data can be used to identify outlier data sets, as well as to increase the understanding of the other data from the experiment.

Download Full-text