SecProMTB: Support Vector Machine‐Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to
            Mycobacterium tuberculosis

As a supervised classification algorithm, Support Vector Machine (SVM) has an excellent ability in solving small samples, nonlinear and high dimensional classification problems. However, SVM is inefficient for imbalanced data sets classification. Therefore, a cost sensitive SVM (CSSVM) should be designed for imbalanced data sets classification. This paper proposes a method which constructed CSSVM based on information entropy, and in this method the information entropies of different classes of data set are used to determine the values of penalty factor of CSSVM.

Download Full-text

Classification with Local Clustering in Imbalanced Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.219-220.151 ◽

2011 ◽

Vol 219-220 ◽

pp. 151-155 ◽

Cited By ~ 2

Author(s):

Hua Ji ◽

Hua Xiang Zhang

Keyword(s):

Data Distribution ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Imbalanced Data Sets ◽

Local Clustering ◽

Rare Class ◽

Novel Method ◽

The Cost

In many real-world domains, learning from imbalanced data sets is always confronted. Since the skewed class distribution brings the challenge for traditional classifiers because of much lower classification accuracy on rare classes, we propose the novel method on classification with local clustering based on the data distribution of the imbalanced data sets to solve this problem. At first, we divide the whole data set into several data groups based on the data distribution. Then we perform local clustering within each group both on the normal class and the disjointed rare class. For rare class, the subsequent over-sampling is employed according to the different rates. At last, we apply support vector machines (SVMS) for classification, by means of the traditional tactic of the cost matrix to enhance the classification accuracies. The experimental results on several UCI data sets show that this method can produces much higher prediction accuracies on the rare class than state-of-art methods.

Download Full-text

Improving secretory proteins prediction in Mycobacterium tuberculosis using the unbiased dipeptide composition with support vector machine

International Journal of Data Mining and Bioinformatics ◽

10.1504/ijdmb.2018.10018958 ◽

2018 ◽

Vol 21 (3) ◽

pp. 212

Author(s):

Saeed Ahmed ◽

Farman Ali ◽

Zakir Ali ◽

Muhammad Arif ◽

Muhammad Kabir ◽

...

Keyword(s):

Support Vector Machine ◽

Mycobacterium Tuberculosis ◽

Support Vector ◽

Secretory Proteins ◽

Dipeptide Composition

Download Full-text

Boosting Support Vector Machines for Imbalanced Data Sets

Lecture Notes in Computer Science - Foundations of Intelligent Systems ◽

10.1007/978-3-540-68123-6_4 ◽

2008 ◽

pp. 38-47 ◽

Cited By ~ 22

Author(s):

Benjamin X. Wang ◽

Nathalie Japkowicz

Keyword(s):

Support Vector Machines ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Imbalanced Data Sets ◽

Vector Machines

Download Full-text

A Novel Weighted Ensemble Method to Overcome the Impact of Under-fitting and Over-fitting on the Classification Accuracy of the Imbalanced Data Sets

Pakistan Journal of Statistics and Operation Research ◽

10.18187/pjsor.v17i2.3640 ◽

2021 ◽

pp. 483-496

Author(s):

Ghulam Fatima ◽

Sana Saeed

Keyword(s):

Data Mining ◽

Imbalanced Data ◽

Ensemble Method ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Imbalanced Data Sets ◽

Sampling Procedures ◽

The Impact

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.

Download Full-text