Enhancing Effectiveness of Dimension Reduction in Text Classification

Nowadays, text is one prevalent forms of data and text classification is a widely used data mining task, which has various application fields. One mass-produced instance of text is email. As a communication medium, despite having a lot of advantages, email suffers from a serious problem. The number of spam emails has steadily increased in the recent years, leading to considerable irritation. Therefore, spam detection has emerged as a separate field of text classification. A primary challenge of text classification, which is more severe in spam detection and impedes the process, is high-dimensionality of feature space. Various dimension reduction methods have been proposed that produce a lower dimensional space compared to the original. These methods are divided mainly into two groups: feature selection and feature extraction. This research deals with dimension reduction in the text classification task and especially performs experiments in the spam detection field. We employ Information Gain (IG) and Chi-square Statistic (CHI) as well-known feature selection methods. Also, we propose a new feature extraction method called Sprinkled Semantic Feature Space (SSFS). Furthermore, this paper presents a new hybrid method called IG_SSFS. In IG_SSFS, we combine the selection and extraction processes to reap the benefits from both. To evaluate the mentioned methods in the spam detection field, experiments are conducted on some well-known email datasets. According to the results, SSFS demonstrated superior effectiveness over the basic selection methods in terms of improving classifiers’ performance, and IG_SSFS further enhanced the performance despite consuming less processing time.

Download Full-text

A NEW FEATURE SELECTION METHOD FOR TEXT CLASSIFICATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005466 ◽

2007 ◽

Vol 21 (02) ◽

pp. 423-438 ◽

Cited By ~ 9

Author(s):

GULDEN UCHYIGIT ◽

KEITH CLARK

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

Computational Time ◽

Small Subset ◽

Selection Methods ◽

New Feature

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Download Full-text

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7163.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1333-1338

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Space ◽

Support Vector ◽

Selection Methods ◽

K Nearest Neighbors ◽

Chi Square ◽

Selection Algorithms ◽

The Impact

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

Download Full-text

Indonesian News Classification Using Naïve Bayes and Two-Phase Feature Selection Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v8.i3.pp610-615 ◽

2017 ◽

Vol 8 (3) ◽

pp. 610 ◽

Cited By ~ 6

Author(s):

M. Ali Fauzi ◽

Agus Zainal Arifin ◽

Sonny Christiano Gosaria

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Information Filtering ◽

Feature Selection Method ◽

Feature Space ◽

Online News ◽

New Method ◽

Second Phase ◽

Two Phase

Since the rise of WWW, information available online is growing rapidly. One of the example is Indonesian online news. Therefore, automatic text classification became very important task for information filtering. One of the major issue in text classification is its high dimensionality of feature space. Most of the features are irrelevant, noisy, and redundant, which may decline the accuracy of the system. Hence, feature selection is needed. Maximal Marginal Relevance for Feature Selection (MMR-FS) has been proven to be a good feature selection for text with many redundant features, but it has high computational complexity. In this paper, we propose a two-phased feature selection method. In the first phase, to lower the complexity of MMR-FS we utilize Information Gain first to reduce features. This reduced feature will be selected using MMR-FS in the second phase. The experiment result showed that our new method can reach the best accuracy by 86%. This new method could lower the complexity of MMR-FS but still retain its accuracy.

Download Full-text

Spam profiles detection on social networks using computational intelligence methods: The effect of the lingual context

Journal of Information Science ◽

10.1177/0165551519861599 ◽

2019 ◽

pp. 016555151986159 ◽

Cited By ~ 1

Author(s):

Ala’ M Al-Zoubi ◽

Ja’far Alqatawna ◽

Hossam Faris ◽

Mohammad A Hassonah

Keyword(s):

Social Networks ◽

Feature Selection ◽

Online Social Networks ◽

Information Gain ◽

Detection Methods ◽

Spam Detection ◽

Selection Methods ◽

Chi Square ◽

Computational Intelligence Methods ◽

Independent Features

In online social networks, spam profiles represent one of the most serious security threats over the Internet; if they do not stop producing bad advertisements, they can be exploited by criminals for various purposes. This article addresses the nature and the characteristics of spam profiles in a social network like Twitter to improve spam detection, based on a number of publicly available language-independent features. In order to investigate the effectiveness of these features in spam detection, four datasets are extracted for four different language contexts (i.e. Arabic, English, Korean and Spanish), and a fifth is formed by combining them all. We conduct our experiments using a set of five well-known classification algorithms in spam detection field, k-Nearest Neighbours ( k-NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT) (J48) and Multilayer Perceptron (MLP) classifiers, along with five filter-based feature selection methods, namely, Information Gain, Chi-square, ReliefF, Correlation and Significance. The results show oscillating performance of each classifier across all datasets, but improved classification results with feature selection. In addition, detailed analysis and comparisons are carried out on two different levels: in the first level, we compare the selected features’ importance among the feature selection methods, whereas in the second level, we observe the relations and the importance of the selected features across all datasets. The findings of this article lead to a better understanding of social spam and improving detection methods by considering the various important features resulting from the different lingual contexts.

Download Full-text

Redundant Feature Selection Methods in Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1044-1045.1258 ◽

2014 ◽

Vol 1044-1045 ◽

pp. 1258-1261

Author(s):

Su Fen Chen

Keyword(s):

Feature Selection ◽

Text Mining ◽

Mutual Information ◽

Text Classification ◽

Feature Space ◽

High Dimensional ◽

Selection Methods ◽

Redundant Feature ◽

Label Information ◽

Better Than

Feature selection is an effective pre-processing technology to facilitate text mining on high dimensional feature space. In recent years, many effective redundant feature selection methods have been proposed from different motivations. However, a comparative experimental study on redundant feature selection methods in the field of text mining has not been reported yet. In order to solve this problem, an extensive empirical comparative study with the task of text classification is given in the paper. The experimental results indicate that the 3-way Mutual Information represents the redundancy much better than traditional 2-way Mutual Information, since the label information are considered by 3-way Mutual Information. As a result, the performances of redundant feature selection methods based on 3-way Mutual Information overwhelm other methods.

Download Full-text

A systematic mapping of feature extraction and feature selection methods of electroencephalogram signals for neurological diseases diagnostic assistance

IEEE Latin America Transactions ◽

10.1109/tla.2021.9448287 ◽

2021 ◽

Vol 19 (5) ◽

pp. 735-745

Author(s):

Wallace Faveron de Almeida ◽

Clodoaldo Aparecido de Moraes Lima ◽

Sarajane Marques Peres

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Neurological Diseases ◽

Selection Methods ◽

Systematic Mapping

Download Full-text

An empirical evaluation of text classification and feature selection methods

Artificial Intelligence Research ◽

10.5430/air.v5n2p70 ◽

2016 ◽

Vol 5 (2) ◽

Author(s):

Muazzam Ahmed Siddiqui

Keyword(s):

Feature Selection ◽

Text Classification ◽

Empirical Evaluation ◽

Selection Methods

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

Review of feature selection methods for text classification

International Journal of Advanced Computer Research ◽

10.19101/ijacr.2020.1048037 ◽

2020 ◽

Vol 10 (49) ◽

pp. 138-152

Author(s):

Muhammad Iqbal ◽

Malik Muneeb Abid ◽

Muhammad Noman Khalid ◽

Amir Manzoor

Keyword(s):

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

BioMed Research International ◽

10.1155/2015/751646 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Yuxing Sun ◽

Bing-Qing Han

Keyword(s):

Feature Selection ◽

Protein Interaction ◽

Text Classification ◽

Protein Interactions ◽

Reduction Rate ◽

Importance Measure ◽

Context Information ◽

Selection Methods ◽

Term Frequency ◽

Context Similarity

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Download Full-text