Efficient n-gram construction for text categorization using feature selection techniques

In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.

Download Full-text

A lazy feature selection method for multi-label classification

Intelligent Data Analysis ◽

10.3233/ida-194878 ◽

2021 ◽

Vol 25 (1) ◽

pp. 21-34

Author(s):

Rafael B. Pereira ◽

Alexandre Plastino ◽

Bianca Zadrozny ◽

Luiz H.C. Merschmann

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Video Classification ◽

Classification Problems ◽

Class Label ◽

New Feature ◽

Feature Selection Techniques ◽

Biomolecular Analysis

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification. This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Download Full-text

Impact of feature selection techniques in Text Classification: An Experimental study

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.spl.3/2019.09.00004 ◽

2019 ◽

Vol 1 (3) ◽

Author(s):

S Rahamat Basha

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Performance Analysis of Feature Selection Techniques for Text Classification

International Research Journal on Advanced Science Hub ◽

10.47392/irjash.2020.259 ◽

2020 ◽

Vol 2 (Special Issue ICSTM 12S) ◽

pp. 44-50

Author(s):

Hemlata Patel ◽

Dhanraj Verma

Keyword(s):

Feature Selection ◽

Performance Analysis ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

A Comparative Study of Recent Feature Selection Techniques Used in Text Classification

IOT with Smart Systems - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-16-3945-6_41 ◽

2022 ◽

pp. 423-436

Author(s):

Gunjan Singh ◽

Rashmi Priya

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Text Classification ◽

Feature Selection Techniques

Download Full-text

Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

Data, Engineering and Applications ◽

10.1007/978-981-13-6347-4_7 ◽

2019 ◽

pp. 71-82

Author(s):

Dilip Singh Sisodia ◽

Ankit Shukla

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Feature Selection Techniques ◽

Automatic Text

Download Full-text

Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization

2007 Innovations in Information Technologies (IIT) ◽

10.1109/iit.2007.4430403 ◽

2007 ◽

Cited By ~ 16

Author(s):

Rehab Duwairi ◽

Mohammad Al-Refai ◽

Natheer Khasawneh

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Arabic Text ◽

Feature Selection Techniques

Download Full-text

Feature Selection using Normalized Weight Method for Tamil Text Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9068.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 9-14

Keyword(s):

Feature Selection ◽

Text Classification ◽

Text Categorization ◽

Selection Process ◽

Word List ◽

Digital Data ◽

Large Set ◽

Response Information ◽

Word Forms ◽

Digitized Documents

The Feature Selection process simplify the Tamil text classification work at present we are in the information age, in this period all the applications has great growth in the domain of World Wide Web, so regional language like Tamil materials such as web pages, e-mails, e-books, and digital data has grown enormously so the retrieval of this Tamil digital document is more wanted by Tamil Document searcher. For quick retrieval of needed Tamil digitized documents among the millions of Tamil web documents, these documents should be classified by content according to their classes. The Tamil Text classification is a background work for many Tamil NLP applications such as query response, information extraction, information summarization, etc. the implementation of text categorization is very important in the information retrieval field. The text categorization assigns a document an appropriate category from a predefined group of categories. Tamil Text Classification classifies the documents based on Tamil text in a Document. Tamil language words are very rich in morphology and hence Tamil language consists of very large set of word forms. So it is important to reduce the features of Tamil text. This paper discusses about Feature selection Using Normalized weight from the huge set of key words from the preprocessed corpus. The Feature selection done by Term Weighting (TF*IDF) normalized method is reducing the size of the key word list which is very useful for training and testing Tamil text classification algorithms.

Download Full-text

Multiple Concept Learning - A Novel Approach to Feature Selection in Text Categorization

Advances in Soft Computing - Soft Computing as Transdisciplinary Science and Technology ◽

10.1007/3-540-32391-0_107 ◽

2007 ◽

pp. 1043-1052

Author(s):

Son Doan ◽

Susumu Horiguchi

Keyword(s):

Feature Selection ◽

Concept Learning ◽

Text Categorization ◽

Novel Approach

Download Full-text

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

10.1007/978-981-16-3153-5_48 ◽

2021 ◽

pp. 455-464

Author(s):

Varun Dogra ◽

Aman Singh ◽

Sahil Verma ◽

Kavita ◽

N. Z. Jhanjhi ◽

...

Keyword(s):

Feature Selection ◽

Dimensionality Reduction ◽

Text Classification ◽

Data Preprocessing ◽

Feature Selection Techniques

Download Full-text

The Research on Tibetan Text Classification Based on N-Gram Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.1896 ◽

2014 ◽

Vol 543-547 ◽

pp. 1896-1900

Author(s):

Deng Zhou ◽

Wen Huang He ◽

Tao Tao Wu

Keyword(s):

Feature Selection ◽

Text Classification ◽

Word Segmentation ◽

Classification Model ◽

Treatment Processes ◽

Pre Treatment ◽

N Gram ◽

Selection Of

This Compared with the traditional text classification model, the Tibetan text classification based on N-Gram model has adopted N-Gram model in terms of the level of word. In other words, during the text classification, word segmentation is not required. Also, feature selection and abundant pre-treatment processes are avoided. This paper not only carried out profound research on N-Gram models, but also discusses the selection of parameter N in the model by adopting Naïve Bayes Multinomial classifier.

Download Full-text