Impact of feature selection techniques in Text Classification: An Experimental study

In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.

Download Full-text

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

10.1007/978-981-16-3153-5_48 ◽

2021 ◽

pp. 455-464

Author(s):

Varun Dogra ◽

Aman Singh ◽

Sahil Verma ◽

Kavita ◽

N. Z. Jhanjhi ◽

...

Keyword(s):

Feature Selection ◽

Dimensionality Reduction ◽

Text Classification ◽

Data Preprocessing ◽

Feature Selection Techniques

Download Full-text

Combining Feature Selection Methods with BERT: An In-depth Experimental Study of Long Text Classification

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Collaborative Computing: Networking, Applications and Worksharing ◽

10.1007/978-3-030-67537-0_34 ◽

2021 ◽

pp. 567-582

Author(s):

Kai Wang ◽

Jiahui Huang ◽

Yuqi Liu ◽

Bin Cao ◽

Jing Fan

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

Research on the feature selection techniques used in text classification

2012 9th International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2012.6234223 ◽

2012 ◽

Cited By ~ 2

Author(s):

Yan Li ◽

Chungang Chen

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Comparison of feature selection techniques in classifying stroke documents

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v14.i3.pp1244-1250 ◽

2019 ◽

Vol 14 (3) ◽

pp. 1244

Author(s):

Nur Syaza Izzati Mohd Rafei ◽

Rohayanti Hassan ◽

RD Rohmat Saedudin ◽

Anis Farihan Mat Raffei ◽

Zalmiyah Zakaria ◽

...

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Biomedical Literature ◽

High Dimensionality ◽

Support Vector ◽

Pearson’S Correlation ◽

Pearson's Correlation ◽

Feature Selection Techniques ◽

Selection Phase

<span>The amount of digital biomedical literature grows that make most of the researchers facing the difficulties to manage and retrieve the required information from the Internet because this task is very challenging. The application of text classification on biomedical literature is one of the solutions in order to solve problem that have been faced by researchers but managing the high dimensionality of data being a common issue on text classification. Therefore, the aim of this research is to compare the techniques that could be used to select the relevant features for classifying biomedical text abstracts. This research focus on Pearson’s Correlation and Information Gain as feature selection techniques for reducing the high dimensionality of data. Towards this effort, we conduct and evaluate several experiments using 100 abstract of stroke documents that retrieved from PubMed database as datasets. This dataset underwent the text pre-processing that is crucial before proceed to feature selection phase. Features selection phase is involving Information Gain and Pearson Correlation technique. Support Vector Machine classifier is used in order to evaluate and compare the effectiveness of two feature selection techniques. For this dataset, Information Gain has outperformed Pearson’s Correlation by 3.3%. This research tends to extract the meaningful features from a subset of stroke documents that can be used for various application especially in diagnose the stroke disease.</span>

Download Full-text

An Experimental Study of Feature Selection Methods for Text Classification

Series in Machine Perception and Artificial Intelligence - Personalization Techniques and Recommender Systems ◽

10.1142/9789812797025_0012 ◽

2008 ◽

pp. 303-320

Author(s):

Gulden Uchyigit ◽

Keith Clark

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

A Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification

International Journal of Science and Research (IJSR) ◽

10.21275/v5i5.nov163675 ◽

2016 ◽

Vol 5 (5) ◽

pp. 1267-1275 ◽

Cited By ~ 4

Keyword(s):

Feature Selection ◽

Text Classification ◽

Classification Algorithms ◽

Feature Selection Techniques

Download Full-text

A Comparative Study on Feature Selection in Chinese Text Classification Problem

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.2854 ◽

2013 ◽

Vol 380-384 ◽

pp. 2854-2857

Author(s):

Hu Li ◽

Peng Zou ◽

Wei Hong Han

Keyword(s):

Feature Selection ◽

Chinese Text ◽

Text Classification ◽

Imbalanced Data ◽

Classification Problem ◽

English Text ◽

Feature Selection Techniques ◽

Actual Classification ◽

Better Than

Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.

Download Full-text