An Examination of Feature Selection Frameworks in Text Categorization

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

A feature selection based on deviation from feature centroid for text categorization

2011 2nd International Conference on Intelligent Control and Information Processing ◽

10.1109/icicip.2011.6008227 ◽

2011 ◽

Author(s):

Jieming Yang ◽

Zhiying Liu

Keyword(s):

Feature Selection ◽

Text Categorization

Download Full-text

Design of Text Categorization System Based on SVM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.1191 ◽

2012 ◽

Vol 532-533 ◽

pp. 1191-1195 ◽

Cited By ~ 1

Author(s):

Zhen Yan Liu ◽

Wei Ping Wang ◽

Yong Wang

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Extraction Methods ◽

Support Vector ◽

Text Representation ◽

Text Feature ◽

Categorization System ◽

Classifier Training

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.

Download Full-text

A HYBRID FEATURE SELECTION METHOD FOR TEXT CATEGORIZATION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488507004492 ◽

2007 ◽

Vol 15 (02) ◽

pp. 133-151 ◽

Cited By ~ 2

Author(s):

E. MONTAÑÉS ◽

J. R. QUEVEDO ◽

E. F. COMBARRO ◽

I. DÍAZ ◽

J. RANILLA

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Hybrid Approach ◽

Feature Selection Method ◽

Selection Method ◽

Fast Method ◽

Evaluation Function ◽

Wrapper Approach ◽

Wrapper Method ◽

Filtering Approach

Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.

Download Full-text

Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization

Procedia Technology ◽

10.1016/j.protcy.2013.12.254 ◽

2013 ◽

Vol 11 ◽

pp. 748-754 ◽

Cited By ~ 6

Author(s):

Hamood Alshalabi ◽

Sabrina Tiun ◽

Nazlia Omar ◽

Mohammed Albared

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Categorization ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

International Conference on Semantic Computing (ICSC 2007) ◽

10.1109/icsc.2007.108 ◽

2007 ◽

Cited By ~ 4

Author(s):

Cui Zifeng ◽

Xu Baowen ◽

Zhang Weifeng ◽

Jiang Dawei ◽

Xu Junling

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection For

Download Full-text

An Algorithm of Feature Selection in Text Categorization Based on Gini-index

Proceedings of the 2015 International Conference on Management Science and Management Innovation ◽

10.2991/msmi-15.2015.50 ◽

2015 ◽

Author(s):

Wei-Dong Zhu ◽

Bo Wang ◽

Yong-Min Lin

Keyword(s):

Feature Selection ◽

Gini Index ◽

Text Categorization

Download Full-text

Feature selection and feature extraction for text categorization

Proceedings of the workshop on Speech and Natural Language - HLT '91 ◽

10.3115/1075527.1075574 ◽

1992 ◽

Cited By ~ 205

Author(s):

David D. Lewis

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Text Categorization

Download Full-text

A Probabilistic Approach to Feature Selection for Multi-class Text Categorization

Advances in Neural Networks – ISNN 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-72383-7_153 ◽

2007 ◽

pp. 1310-1317 ◽

Cited By ~ 5

Author(s):

Ke Wu ◽

Bao-Liang Lu ◽

Masao Uchiyama ◽

Hitoshi Isahara

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Probabilistic Approach ◽

Selection For

Download Full-text

Using the Text Categorization Framework for Protein Classification

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch008 ◽

2010 ◽

pp. 128-140 ◽

Cited By ~ 1

Author(s):

Ricco Rakotomalala ◽

Faouzi Mhamdi

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Predictive Model ◽

Text Categorization ◽

Learning Algorithm ◽

Support Vector ◽

Protein Classification ◽

Fixed Length ◽

Selection Algorithms ◽

Proteins Classification

In this chapter, we are interested in proteins classification starting from their primary structures. The goal is to automatically affect proteins sequences to their families. The main originality of the approach is that we directly apply the text categorization framework for the protein classification with very minor modifications. The main steps of the task are clearly identified: we must extract features from the unstructured dataset, we use the fixed length n-grams descriptors; we select and combine the most relevant one for the learning phase; and then, we select the most promising learning algorithm in order to produce accurate predictive model. We obtain essentially two main results. First, the approach is credible, giving accurate results with only 2-grams descriptors length. Second, in our context where many irrelevant descriptors are automatically generated, we must combine aggressive feature selection algorithms and low variance classifiers such as SVM (Support Vector Machine).

Download Full-text