A CATEGORY CLASSIFICATION ALGORITHM FOR INDONESIAN AND MALAY NEWS DOCUMENTS

Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.

Download Full-text

A NEW FEATURE SELECTION METHOD FOR TEXT CLASSIFICATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005466 ◽

2007 ◽

Vol 21 (02) ◽

pp. 423-438 ◽

Cited By ~ 9

Author(s):

GULDEN UCHYIGIT ◽

KEITH CLARK

Keyword(s):

Feature Selection ◽

Text Classification ◽

Information Gain ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

Computational Time ◽

Small Subset ◽

Selection Methods ◽

New Feature

Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (χ2) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F1 and F2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.

Download Full-text

A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

Mathematical Problems in Engineering ◽

10.1155/2020/6076272 ◽

2020 ◽

Vol 2020 ◽

pp. 1-14 ◽

Cited By ~ 3

Author(s):

Yong Liu ◽

Shenggen Ju ◽

Junfeng Wang ◽

Chong Su

Keyword(s):

Feature Selection ◽

Text Classification ◽

Predictive Accuracy ◽

Feature Selection Method ◽

Feature Space ◽

Selection Method ◽

The Other ◽

Feature Subset ◽

Search Range ◽

Text Documents

Feature selection method is designed to select the representative feature subsets from the original feature set by different evaluation of feature relevance, which focuses on reducing the dimension of the features while maintaining the predictive accuracy of a classifier. In this study, we propose a feature selection method for text classification based on independent feature space search. Firstly, a relative document-term frequency difference (RDTFD) method is proposed to divide the features in all text documents into two independent feature sets according to the features’ ability to discriminate the positive and negative samples, which has two important functions: one is to improve the high class correlation of the features and reduce the correlation between the features and the other is to reduce the search range of feature space and maintain appropriate feature redundancy. Secondly, the feature search strategy is used to search the optimal feature subset in independent feature space, which can improve the performance of text classification. Finally, we evaluate several experiments conduced on six benchmark corpora, the experimental results show the RDTFD method based on independent feature space search is more robust than the other feature selection methods.

Download Full-text

Hybrid Feature Selection Method Based on Harmony Search and Naked Mole-Rat Algorithms for Spoken Language Identification From Audio Signals

IEEE Access ◽

10.1109/access.2020.3028121 ◽

2020 ◽

Vol 8 ◽

pp. 182868-182887 ◽

Cited By ~ 1

Author(s):

Samarpan Guha ◽

Aankit Das ◽

Pawan Kumar Singh ◽

Ali Ahmadian ◽

Norazak Senu ◽

...

Keyword(s):

Feature Selection ◽

Harmony Search ◽

Feature Selection Method ◽

Spoken Language ◽

Selection Method ◽

Language Identification ◽

Audio Signals ◽

Mole Rat ◽

Naked Mole Rat

Download Full-text

A Novel Feature Selection Method Based on Category Distribution Ratio in Text Classification

Proceedings of the 2019 International Conference on Mathematics, Big Data Analysis and Simulation and Modelling (MBDASM 2019) ◽

10.2991/mbdasm-19.2019.45 ◽

2019 ◽

Author(s):

Pujian Zong ◽

Jian Bian

Keyword(s):

Feature Selection ◽

Text Classification ◽

Distribution Ratio ◽

Feature Selection Method ◽

Selection Method

Download Full-text

Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks

ComTech Computer Mathematics and Engineering Applications ◽

10.21512/comtech.v4i1.2745 ◽

2013 ◽

Vol 4 (1) ◽

pp. 333

Author(s):

Mediana Aryuni

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Majority Voting ◽

Iterative Refinement ◽

Ensemble Method ◽

Computational Time ◽

Feature Clustering

An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively.

Download Full-text

A mutual information and information entropy pair based feature selection method in text classification

2010 International Conference on Computer Application and System Modeling (ICCASM 2010) ◽

10.1109/iccasm.2010.5620805 ◽

2010 ◽

Author(s):

Zhili Pei ◽

Yuxin Zhou ◽

Lisha Liu ◽

Lihua Wang ◽

Yinan Lu ◽

...

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Text Classification ◽

Information Entropy ◽

Feature Selection Method ◽

Selection Method ◽

Entropy Pair

Download Full-text

New Feature Selection Method Based on SVM-RFE

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.926-930.3100 ◽

2014 ◽

Vol 926-930 ◽

pp. 3100-3104 ◽

Cited By ~ 1

Author(s):

Xi Wang ◽

Qiang Li ◽

Zhi Hong Xie

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Recognition Accuracy ◽

Feature Selection Method ◽

Selection Method ◽

Testing Time ◽

Feature Subset ◽

Selection Algorithm ◽

Accuracy Rate ◽

New Feature

This article analyzed the defects of SVM-RFE feature selection algorithm, put forward new feature selection method combined SVM-RFE and PCA. Firstly, get the best feature subset through the method of cross validation of k based on SVM-RFE. Then, the PCA decreased the dimension of the feature subset and got the independent feature subset. The independent feature subset was the training and testing subset of SVM. Make experiments on five subsets of UCI, the results indicated that the training and testing time was shortened and the recognition accuracy rate of the SVM was higher.

Download Full-text

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Mathematical Problems in Engineering ◽

10.1155/2015/695720 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Heyong Wang ◽

Ming Hong

Keyword(s):

Feature Selection ◽

Text Mining ◽

Text Classification ◽

Web Applications ◽

Rapid Development ◽

Feature Selection Method ◽

Selection Method ◽

Text Documents ◽

Unsupervised Feature Selection ◽

Classification Feature

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

Download Full-text

A novel probabilistic feature selection method for text classification

Knowledge-Based Systems ◽

10.1016/j.knosys.2012.06.005 ◽

2012 ◽

Vol 36 ◽

pp. 226-235 ◽

Cited By ~ 145

Author(s):

Alper Kursat Uysal ◽

Serkan Gunal

Keyword(s):

Feature Selection ◽

Text Classification ◽

Feature Selection Method ◽

Selection Method

Download Full-text

Enhanced Classification Method for Phishing Emails Detection

Journal of Information Security and Cybercrimes Research ◽

10.26735/ygmy6142 ◽

2020 ◽

Vol 3 (1) ◽

pp. 58-63

Author(s):

Y. Mansour Mansour ◽

Majed A. Alenizi

Keyword(s):

Feature Selection ◽

Information Gain ◽

Hybrid Approach ◽

Feature Selection Method ◽

Search Space ◽

Selection Method ◽

Classification Model ◽

Selection Methods ◽

Accuracy Rate ◽

Communication Method

Emails are currently the main communication method worldwide as it proven in its efficiency. Phishing emails in the other hand is one of the major threats which results in significant losses, estimated at billions of dollars. Phishing emails is a more dynamic problem, a struggle between the phishers and defenders where the phishers have more flexibility in manipulating the emails features and evading the anti-phishing techniques. Many solutions have been proposed to mitigate the phishing emails impact on the targeted sectors, but none have achieved 100% detection and accuracy. As phishing techniques are evolving, the solutions need to be evolved and generalized in order to mitigate as much as possible. This article presents a new emergent classification model based on hybrid feature selection method that combines two common feature selection methods, Information Gain and Genetic Algorithm that keep only significant and high-quality features in the final classifier. The Proposed hybrid approach achieved 98.9% accuracy rate against phishing emails dataset comprising 8266 instances and results depict enhancement by almost 4%. Furthermore, the presented technique has contributed to reducing the search space by reducing the number of selected features.

Download Full-text