FIVE NEW FEATURE SELECTION METRICS IN TEXT CATEGORIZATION

Feature selection has been extensively applied in statistical pattern recognition as a mechanism for cleaning up the set of features that are used to represent data and as a way of improving the performance of classifiers. Four schemes commonly used for feature selection are Exponential Searches, Stochastic Searches, Sequential Searches, and Best Individual Features. The most popular scheme used in text categorization is Best Individual Features as the extremely high dimensionality of text feature spaces render the other three feature selection schemes time prohibitive. This paper proposes five new metrics for selecting Best Individual Features for use in text categorization. Their effectiveness have been empirically tested on two well- known data collections, Reuters-21578 and 20 Newsgroups. Experimental results show that the performance of two of the five new metrics, Bayesian Rule and F-one Value, is not significantly below that of a good traditional text categorization selection metric, Document Frequency. The performance of another two of these five new metrics, Low Loss Dimensionality Reduction and Relative Frequency Difference, is equal to or better than that of conventional good feature selection metrics such as Mutual Information and Chi-square Statistic.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

Design of Text Categorization System Based on SVM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.1191 ◽

2012 ◽

Vol 532-533 ◽

pp. 1191-1195 ◽

Cited By ~ 1

Author(s):

Zhen Yan Liu ◽

Wei Ping Wang ◽

Yong Wang

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Extraction Methods ◽

Support Vector ◽

Text Representation ◽

Text Feature ◽

Categorization System ◽

Classifier Training

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.

Download Full-text

Competitive Particle Swarm Optimization for Multi-Category Text Feature Selection

Entropy ◽

10.3390/e21060602 ◽

2019 ◽

Vol 21 (6) ◽

pp. 602 ◽

Cited By ~ 2

Author(s):

Jaesung Lee ◽

Jaegyun Park ◽

Hae-Cheon Kim ◽

Dae-Won Kim

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Text Categorization ◽

Relative Effectiveness ◽

Search Process ◽

Feature Subset ◽

Evolutionary Search ◽

Swarm Optimization ◽

Text Feature ◽

Conventional Methods

Multi-label feature selection is an important task for text categorization. This is because it enables learning algorithms to focus on essential features that foreshadow relevant categories, thereby improving the accuracy of text categorization. Recent studies have considered the hybridization of evolutionary feature wrappers and filters to enhance the evolutionary search process. However, the relative effectiveness of feature subset searches of evolutionary and feature filter operators has not been considered. This results in degenerated final feature subsets. In this paper, we propose a novel hybridization approach based on competition between the operators. This enables the proposed algorithm to apply each operator selectively and modify the feature subset according to its relative effectiveness, unlike conventional methods. The experimental results on 16 text datasets verify that the proposed method is superior to conventional methods.

Download Full-text

An Improved Native Bayes Classifier for Imbalanced Text Categorization Based on K-Means and Chi-Square Feature Selection

2018 Eighth International Conference on Instrumentation & Measurement, Computer, Communication and Control (IMCCC) ◽

10.1109/imccc.2018.00189 ◽

2018 ◽

Author(s):

Fanbo Meng ◽

Linying Xu

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Bayes Classifier ◽

Chi Square

Download Full-text

Comparison and Improvements of Feature Extraction Methods for Text Categorization

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.599-601.1824 ◽

2014 ◽

Vol 599-601 ◽

pp. 1824-1828

Author(s):

Juan Wang ◽

Zhi Xun Zhang ◽

Yong Dong Wang

Keyword(s):

Feature Extraction ◽

Mutual Information ◽

Text Classification ◽

Text Categorization ◽

Information Gain ◽

Extraction Methods ◽

Improved Method ◽

Document Frequency ◽

Text Feature

Feature extraction is a key point of text categorization[1]. The accuracy of extraction will directly affect the accuracy of text classification. This paper introduces and compares 4 commonly used methods of text feature extraction: IG (Information gain), MI (Mutual information), CHI (statistics), DF (Document frequency), and proposes an improved method based on the method of CHI. Experiment result shows that the proposed method can improve the accuracy of text categorization.

Download Full-text

A new feature selection method for text categorization based on information gain and particle swarm optimization

2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems ◽

10.1109/ccis.2014.7175792 ◽

2014 ◽

Cited By ~ 5

Author(s):

Ferruh Yigit ◽

Omer Kaan Baykan

Keyword(s):

Feature Selection ◽

Particle Swarm Optimization ◽

Text Categorization ◽

Information Gain ◽

Particle Swarm ◽

Feature Selection Method ◽

Selection Method ◽

Swarm Optimization ◽

New Feature

Download Full-text

GU METRIC - A New Feature Selection Algorithm for Text Categorization

Proceedings of the Ninth International Conference on Enterprise Information Systems ◽

10.5220/0002365503990402 ◽

2007 ◽

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

New Feature

Download Full-text

New feature selection methods based on context similarity for text categorization

2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2014.6980902 ◽

2014 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Bingqing Han ◽

Ping Hou

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Methods ◽

Context Similarity ◽

New Feature

Download Full-text

A Comparative Study on Feature Selection of Text Categorization for Hidden Markov Models

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais341 ◽

2013 ◽

Author(s):

Kwan Yi ◽

Jamshid Beheshti

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Markov Models ◽

Hidden Markov ◽

Model Performance ◽

Document Representation ◽

Selection Methods ◽

Learning Models ◽

Text Feature ◽

Selection Of

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle.

Download Full-text

An evaluation of existing and new feature selection metrics in text categorization

2008 23rd International Symposium on Computer and Information Sciences ◽

10.1109/iscis.2008.4717900 ◽

2008 ◽

Cited By ~ 3

Author(s):

Serafettin Tasci ◽

Tunga Gungor

Keyword(s):

Feature Selection ◽

Text Categorization ◽

New Feature

Download Full-text