A new feature selection method based on distributional information for Text Classification

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Download Full-text

English Text Classification Using Improved Recursive Feature Elimination (IRFE) Algorithm: تصنيف النص الإنجليزي باستخدام الخوارزمية العودية المحسنة لإزالة الخواص (IRFE)

Journal of engineering sciences and information technology - مجلة العلوم الهندسية و تكنولوجيا المعلومات ◽

10.26389/ajsrp.r080420 ◽

2020 ◽

Vol 4 (2) ◽

Author(s):

Esraa H. Abd Al-Ameer, Ahmed H. Aliwy

Keyword(s):

Feature Selection ◽

Language Processing ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

English Text ◽

Recursive Feature Elimination ◽

Chi Square ◽

Data Set ◽

New Feature

Documents classification is from most important fields for Natural language processing and text mining. There are many algorithms can be used for this task. In this paper, focuses on improving Text Classification by feature selection. This means determine some of the original features without affecting the accuracy of the work, where our work is a new feature selection method was suggested which can be a general formulation and mathematical model of Recursive Feature Elimination (RFE). The used method was compared with other two well-known feature selection methods: Chi-square and threshold. The results proved that the new method is comparable with the other methods, The best results were 83% when 60% of features used, 82% when 40% of features used, and 82% when 20% of features used. The tests were done with the Naïve Bayes (NB) and decision tree (DT) classification algorithms , where the used dataset is a well-known English data set “20 newsgroups text” consists of approximately 18846 files. The results showed that our suggested feature selection method is comparable with standard Like Chi-square.

Download Full-text

A new feature selection method for handling redundant information in text classification

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1601761 ◽

2018 ◽

Vol 19 (2) ◽

pp. 221-234 ◽

Cited By ~ 6

Author(s):

You-wei Wang ◽

Li-zhou Feng

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Redundant Information ◽

New Feature

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

A New Feature Selection Method for Enhancing Cancer Diagnosis Based on DNA Microarray

2020 37th National Radio Science Conference (NRSC) ◽

10.1109/nrsc49500.2020.9235095 ◽

2020 ◽

Author(s):

Mostafa Atlam ◽

Hanaa Torkey ◽

Hanaa Salem ◽

Nawal El-Fishawy

Keyword(s):

Feature Selection ◽

Dna Microarray ◽

Cancer Diagnosis ◽

Feature Selection Method ◽

Selection Method ◽

New Feature

Download Full-text

A CATEGORY CLASSIFICATION ALGORITHM FOR INDONESIAN AND MALAY NEWS DOCUMENTS

Jurnal Teknologi ◽

10.11113/jt.v78.9549 ◽

2016 ◽

Vol 78 (8-2) ◽

Cited By ~ 1

Author(s):

Jafreezal Jaafar ◽

Zul Indra ◽

Nurshuhaini Zamin

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method ◽

Online News ◽

Language Identification ◽

Computational Time ◽

Accuracy Rate ◽

Similar Morphology ◽

Manual Classification

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.

Download Full-text