A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.

Download Full-text

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

A Comparative Study on Feature Selection of Text Categorization for Hidden Markov Models

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais341 ◽

2013 ◽

Author(s):

Kwan Yi ◽

Jamshid Beheshti

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Markov Models ◽

Hidden Markov ◽

Model Performance ◽

Document Representation ◽

Selection Methods ◽

Learning Models ◽

Text Feature ◽

Selection Of

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle.

Download Full-text

TTC-3600: A new benchmark dataset for Turkish text categorization

Journal of Information Science ◽

10.1177/0165551515620551 ◽

2015 ◽

Vol 43 (2) ◽

pp. 174-185 ◽

Cited By ~ 23

Author(s):

Deniz Kılınç ◽

Akın Özçift ◽

Fatma Bozyigit ◽

Pelin Yıldırım ◽

Fatih Yücalar ◽

...

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Learning Task ◽

Selection Method ◽

Random Forest Classifier ◽

Experimental Results ◽

Selection Methods ◽

File Formats ◽

Accuracy Criterion

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.

Download Full-text

Combining Multiple Feature Selection Methods for Text Categorization by Using Rank-Score Characteristics

2009 21st IEEE International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2009.129 ◽

2009 ◽

Cited By ~ 11

Author(s):

Yanjun Li ◽

D. Frank Hsu ◽

Soon M. Chung

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Methods ◽

Multiple Feature ◽

Rank Score

Download Full-text

A framework of feature selection methods for text categorization

Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09 ◽

10.3115/1690219.1690243 ◽

2009 ◽

Cited By ~ 43

Author(s):

Shoushan Li ◽

Rui Xia ◽

Chengqing Zong ◽

Chu-Ren Huang

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Methods

Download Full-text

The Hybrid Filter Feature Selection Methods for Improving High-Dimensional Text Categorization

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s021848851750009x ◽

2017 ◽

Vol 25 (02) ◽

pp. 235-265 ◽

Cited By ~ 4

Author(s):

Le Nguyen Hoai Nam ◽

Ho Bao Quoc

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Reduction Rate ◽

Feature Selection Method ◽

Superior Performance ◽

Large Set ◽

Bag Of Words ◽

Selection Methods ◽

Hybrid Filter ◽

Cluster Feature

The bag-of-words technique is often used to present a document in text categorization. However, for a large set of documents where the dimension of the bag-of-words vector is very high, text categorization becomes a serious challenge as a result of sparse data, over-fitting, and irrelevant features. A filter feature selection method reduces the number of features by eliminating irrelevant features from the bag-of-words vector. In this paper, we analyze the weak points and strong points of two filter feature selection approaches which are the frequency-based approach and the cluster-based approach. Thanks to the analysis, we propose hybrid filter feature selection methods, named the Frequency-Cluster Feature Selection (FCFS) and the Detailed Frequency-Cluster Feature Selection (DtFCFS), to further improve the performance of the filter feature selection process in text categorization. The FCFS is a combination of the Frequency-based approach and the Cluster-based approach, while the DtFCFS, a detailed version of the FCFS, is a comprehensively hybrid clusterbased method. We do experiments with four benchmark datasets (the Reuters-21578 and Newsgroup dataset for news classification, the Ohsumed dataset for medical document classification, and the LingSpam dataset for email classification) to compare the proposed methods with six related wellknown methods such as the Comprehensive Measurement Feature Selection (CMFS), the Optimal Orthogonal Centroid Feature Selection (OCFS), the Crossed Centroid Feature Selection (CIIC), the Information Gain (IG), the Chi-square (CHI), and the Deviation from Poisson Feature Selection (DFPFS). In terms of the Micro-F1, the Macro-F1, and the dimension reduction rate, the DtFCFS is superior to the other methods, while the FCFS shows competitive and even superior performance to the good methods, especially for the Macro-F1.

Download Full-text