Efficient method for feature selection in text classification

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

The Evaluation of Accuracy Performance in an Enhanced Embedded Feature Selection for Unstructured Text Classification

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.12.28 ◽

2020 ◽

pp. 3397-3407

Author(s):

Nur Syafiqah Mohd Nafis ◽

Suryanti Awang

Keyword(s):

Feature Selection ◽

Text Classification ◽

Training Dataset ◽

Recursive Feature Elimination ◽

High Dimensional ◽

Significant Feature ◽

Support Vector ◽

Svm Classifier ◽

Text Documents ◽

Text Document

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.

Download Full-text

Review of feature selection methods for text classification

International Journal of Advanced Computer Research ◽

10.19101/ijacr.2020.1048037 ◽

2020 ◽

Vol 10 (49) ◽

pp. 138-152

Author(s):

Muhammad Iqbal ◽

Malik Muneeb Abid ◽

Muhammad Noman Khalid ◽

Amir Manzoor

Keyword(s):

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

Impact of feature selection techniques in Text Classification: An Experimental study

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.spl.3/2019.09.00004 ◽

2019 ◽

Vol 1 (3) ◽

Author(s):

S Rahamat Basha

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Performance Analysis of Feature Selection Techniques for Text Classification

International Research Journal on Advanced Science Hub ◽

10.47392/irjash.2020.259 ◽

2020 ◽

Vol 2 (Special Issue ICSTM 12S) ◽

pp. 44-50

Author(s):

Hemlata Patel ◽

Dhanraj Verma

Keyword(s):

Feature Selection ◽

Performance Analysis ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

BioMed Research International ◽

10.1155/2015/751646 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Yuxing Sun ◽

Bing-Qing Han

Keyword(s):

Feature Selection ◽

Protein Interaction ◽

Text Classification ◽

Protein Interactions ◽

Reduction Rate ◽

Importance Measure ◽

Context Information ◽

Selection Methods ◽

Term Frequency ◽

Context Similarity

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Download Full-text

Efficient method for feature selection in text classification

Survey of Feature Selection and Text Classification Methods for Genetic Mutation Classification

Application of texture-based features for text non-text classification in printed document images with novel feature selection algorithm

An empirical evaluation of text classification and feature selection methods

A CATEGORY CLASSIFICATION ALGORITHM FOR INDONESIAN AND MALAY NEWS DOCUMENTS

Incorporate Syntactic Information for Short Text Classification

The Evaluation of Accuracy Performance in an Enhanced Embedded Feature Selection for Unstructured Text Classification

Review of feature selection methods for text classification

Impact of feature selection techniques in Text Classification: An Experimental study

Performance Analysis of Feature Selection Techniques for Text Classification

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

Export Citation Format