Text Classification Using Ensemble Of Non-Linear Support Vector Machines

With the advent of digital era, billions of the documents generate every day that need to be managed, processed and classified. Enormous size of text data is available on world wide web and other sources. As a first step of managing this mammoth data is the classification of available documents in right categories. Supervised machine learning approaches try to solve the problem of document classification but working on large data sets of heterogeneous classes is a big challenge. Automatic tagging and classification of the text document is a useful task due to its many potential applications such as classifying emails into spam or non-spam categories, news articles into political, entertainment, stock market, sports news, etc. The paper proposes a novel approach for classifying the text into known classes using an ensemble of refined Support Vector Machines. The advantage of proposed technique is that it can considerably reduce the size of the training data by adopting dimensionality reduction as pre-training step. The proposed technique has been used on three bench-marked data sets namely CMU Dataset, 20 Newsgroups Dataset, and Classic Dataset. Experimental results show that proposed approach is more accurate and efficient as compared to other state-of-the-art methods.

Download Full-text

Support Vector Machines on Large Data Sets: Simple Parallel Approaches

Studies in Classification, Data Analysis, and Knowledge Organization - Data Analysis, Machine Learning and Knowledge Discovery ◽

10.1007/978-3-319-01595-8_10 ◽

2013 ◽

pp. 87-95 ◽

Cited By ~ 5

Author(s):

Oliver Meyer ◽

Bernd Bischl ◽

Claus Weihs

Keyword(s):

Support Vector Machines ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Vector Machines

Download Full-text

Multi-Class Support Vector Machines for Large Data Sets via Minimum Enclosing Ball Clustering

2007 4th International Conference on Electrical and Electronics Engineering ◽

10.1109/iceee.2007.4344994 ◽

2007 ◽

Cited By ~ 2

Author(s):

Jair Cervantes ◽

Xiaoou Li ◽

Wen Yu ◽

Javier Bejarano

Keyword(s):

Support Vector Machines ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Minimum Enclosing Ball

Download Full-text

Using support vector machines for mining regression classes in large data sets

2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM '02. Proceedings. ◽

10.1109/tencon.2002.1181221 ◽

2004 ◽

Author(s):

Zonghai Sun ◽

Lixin Gao ◽

Youxian Sun

Keyword(s):

Support Vector Machines ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Vector Machines

Download Full-text

Study on PMV Index Forecasting Method Based on Fuzzy C-Means Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.383-390.925 ◽

2011 ◽

Vol 383-390 ◽

pp. 925-930

Author(s):

Chun Cheng Zhang ◽

Xiang Guang Chen ◽

Yuan Qing Xu

Keyword(s):

Support Vector Machines ◽

Clustering Algorithm ◽

Large Data ◽

Support Vector ◽

Data Sets ◽

Forecasting Accuracy ◽

Fuzzy C Means ◽

Forecasting Method ◽

Vector Machines ◽

Fuzzy C Means Clustering

In order to improve the forecasting accuracy of indoor thermal comfort, the basic principle of fuzzy c-means clustering algorithm (FCM) and support vector machines (SVM) is analyzed. A kind of SVM forecasting method based on FCM data preprocess is proposed in this paper. The large data sets can be divided into multiple mixed groups and each group is represented by a single regression model using the proposed method. The support vector machines based on fuzzy c-means clustering algorithm (FCM+SVM) and the BP neural network based on fuzzy c-means clustering algorithm (FCM+BPNN) are respectively applied to forecast PMV index. The experimental results demonstrate that the FCM+SVM method has better forecasting accuracy compared with FCM+BPNN method.

Download Full-text

Fast classification for large data sets via random selection clustering and Support Vector Machines

Intelligent Data Analysis ◽

10.3233/ida-2012-00558 ◽

2012 ◽

Vol 16 (6) ◽

pp. 897-914 ◽

Cited By ~ 5

Author(s):

Xiaoou Li ◽

Jair Cervantes ◽

Wen Yu

Keyword(s):

Support Vector Machines ◽

Large Data ◽

Random Selection ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Fast Classification

Download Full-text

SCALING LARGE LEARNING PROBLEMS WITH HARD PARALLEL MIXTURES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001403002411 ◽

2003 ◽

Vol 17 (03) ◽

pp. 349-365 ◽

Cited By ~ 15

Author(s):

RONAN COLLOBERT ◽

YOSHUA BENGIO ◽

SAMY BENGIO

Keyword(s):

Large Data ◽

Generative Models ◽

Training Data ◽

Learning Problems ◽

Support Vector ◽

Data Sets ◽

Training Time ◽

Vector Machines ◽

Research Challenge ◽

Probabilistic Version

A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a "hard parallelizable mixture" methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a "gater" model in such a way that it becomes easy to learn an "expert" model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to local growth linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm probably goes down in a cost function that is an upper bound on the negative log-likelihood.

Download Full-text