An empirical evaluation of text classification and feature selection methods

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

Review of feature selection methods for text classification

International Journal of Advanced Computer Research ◽

10.19101/ijacr.2020.1048037 ◽

2020 ◽

Vol 10 (49) ◽

pp. 138-152

Author(s):

Muhammad Iqbal ◽

Malik Muneeb Abid ◽

Muhammad Noman Khalid ◽

Amir Manzoor

Keyword(s):

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection

BioMed Research International ◽

10.1155/2015/751646 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Yifei Chen ◽

Yuxing Sun ◽

Bing-Qing Han

Keyword(s):

Feature Selection ◽

Protein Interaction ◽

Text Classification ◽

Protein Interactions ◽

Reduction Rate ◽

Importance Measure ◽

Context Information ◽

Selection Methods ◽

Term Frequency ◽

Context Similarity

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of theF1measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

Download Full-text

Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods

Applied Soft Computing ◽

10.1016/j.asoc.2019.105836 ◽

2020 ◽

Vol 86 ◽

pp. 105836 ◽

Cited By ~ 56

Author(s):

Gang Kou ◽

Pei Yang ◽

Yi Peng ◽

Feng Xiao ◽

Yang Chen ◽

...

Keyword(s):

Decision Making ◽

Feature Selection ◽

Text Classification ◽

Multiple Criteria Decision Making ◽

Multiple Criteria ◽

Selection Methods

Download Full-text

A NEW FEATURE SELECTION METHOD FOR TEXT CLASSIFICATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005466 ◽

2007 ◽

Vol 21 (02) ◽

pp. 423-438 ◽

Cited By ~ 9

Author(s):

GULDEN UCHYIGIT ◽

KEITH CLARK

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

Computational Time ◽

Small Subset ◽

Selection Methods ◽

New Feature

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text

Research on N-grams feature selection methods for text classification

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/1031/1/012048 ◽

2021 ◽

Vol 1031 (1) ◽

pp. 012048

Author(s):

Tsvetanka Georgieva-Trifonova ◽

Mahmut Duraku

Keyword(s):

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text

Ensemble of Feature Selection Methods for Text Classification: An Analytical Study

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-319-76348-4_33 ◽

2018 ◽

pp. 337-349 ◽

Cited By ~ 1

Author(s):

D. S. Guru ◽

Mahamad Suhil ◽

S. K. Pavithra ◽

G. R. Priya

Keyword(s):

Feature Selection ◽

Text Classification ◽

Analytical Study ◽

Selection Methods

Download Full-text

The Impact of Feature Selection Methods for Classifying Arabic Textual Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7163.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1333-1338

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Space ◽

Support Vector ◽

Selection Methods ◽

K Nearest Neighbors ◽

Chi Square ◽

Selection Algorithms ◽

The Impact

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.

Download Full-text

Combining Feature Selection Methods with BERT: An In-depth Experimental Study of Long Text Classification

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Collaborative Computing: Networking, Applications and Worksharing ◽

10.1007/978-3-030-67537-0_34 ◽

2021 ◽

pp. 567-582

Author(s):

Kai Wang ◽

Jiahui Huang ◽

Yuqi Liu ◽

Bin Cao ◽

Jing Fan

Keyword(s):

Experimental Study ◽

Feature Selection ◽

Text Classification ◽

Selection Methods

Download Full-text