An Experimental Study for the Effect of Stop Words Elimination for Arabic Text Classification Algorithms

In this paper, an experimental study was conducted on three techniques for Arabic text classification. These techniques are Support Vector Machine (SVM) with Sequential Minimal Optimization (SMO), Naïve Bayesian (NB), and J48. The paper assesses the accuracy for each classifier and determines which classifier is more accurate for Arabic text classification based on stop words elimination. The accuracy for each classifier is measured by Percentage split method (holdout), and K-fold cross validation methods, along with the time needed to classify Arabic text. The results show that the SMO classifier achieves the highest accuracy and the lowest error rate, and shows that the time needed to build the SMO model is much lower compared to other classification techniques.

Download Full-text

The Effect of Stemming on Arabic Text Classification

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2011070104 ◽

2011 ◽

Vol 1 (3) ◽

pp. 54-70 ◽

Cited By ~ 11

Author(s):

Abdullah Wahbeh ◽

Mohammed Al-Kabi ◽

Qasem Al-Radaideh ◽

Emad Al-Shawakfa ◽

Izzat Alsmadi

Keyword(s):

Text Classification ◽

Digital Libraries ◽

Arabic Language ◽

Support Vector ◽

Svm Classifier ◽

Arabic Text ◽

Text Documents ◽

Information Retrieval Systems ◽

Arabic Text Classification ◽

The Web

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.

Download Full-text

The impact of indexing approaches on Arabic text classification

Journal of Information Science ◽

10.1177/0165551515625030 ◽

2016 ◽

Vol 43 (2) ◽

pp. 159-173 ◽

Cited By ~ 10

Author(s):

Amer Al-Badarneh ◽

Emad Al-Shawakfa ◽

Basel Bani-Ismail ◽

Khaleel Al-Rababah ◽

Safwan Shatnawi

Keyword(s):

Cross Validation ◽

Arabic Text ◽

Word Form ◽

Bayes Classifier ◽

Stem Form ◽

Average Accuracy ◽

Arabic Text Classification ◽

The Impact ◽

And Storage ◽

Fold Cross Validation

This paper investigates the impact of using different indexing approaches (full-word, stem, and root) when classifying Arabic text. In this study, the naïve Bayes classifier is used to construct the multinomial classification models and is evaluated using stratified k-fold cross-validation ( k ranges from 2 to 10). It is also uses a corpus that consists of 1000 normalized Arabic documents. The results of one experiment in this study show that significant accuracy improvements have occurred when the full-word form is used in most k-folds. Further experiments show that the classifier has achieved the highest accuracy in the eight-fold by using 7/8–1/8 train–test ratio, despite the indexing approach being used. The overall results of this study show that the classifier has achieved the maximum micro-average accuracy 99.36%, either by using the full-word form or the stem form. This proves that the stem is a better choice to use when classifying Arabic text, because it makes the corpus dataset smaller and this will enhance both the processing time and storage utilization, and achieve the highest level of accuracy.

Download Full-text

An experimental study for Arabic text classification techniques

10.1117/12.946039 ◽

2012 ◽

Author(s):

Bassam Al-Shargabi ◽

Fekry Olayah

Keyword(s):

Experimental Study ◽

Text Classification ◽

Arabic Text ◽

Classification Techniques ◽

Arabic Text Classification

Download Full-text

Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2015.060228 ◽

2015 ◽

Vol 6 (2) ◽

Cited By ~ 6

Author(s):

Ghazi Raho ◽

Riyad Al-Shalabi ◽

Ghassan Kanaan ◽

Asmaa Nassar

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Text Classification ◽

Classification Algorithms ◽

Arabic Text ◽

Classification Feature ◽

Arabic Text Classification

Download Full-text

A comparative study for Arabic text classification algorithms based on stop words elimination

Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications - ISWSA '11 ◽

10.1145/1980822.1980833 ◽

2011 ◽

Cited By ~ 14

Author(s):

Bassam Al-Shargabi ◽

Waseem Al-Romimah ◽

Fekry Olayah

Keyword(s):

Comparative Study ◽

Text Classification ◽

Classification Algorithms ◽

Arabic Text ◽

Arabic Text Classification

Download Full-text

The Effect of Stemming on Arabic Text Classification

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch013 ◽

2013 ◽

pp. 207-225 ◽

Cited By ~ 3

Author(s):

Abdullah Wahbeh ◽

Mohammed Al-Kabi ◽

Qasem Al-Radaideh ◽

Emad Al-Shawakfa ◽

Izzat Alsmadi

Keyword(s):

Text Classification ◽

Digital Libraries ◽

Arabic Language ◽

Support Vector ◽

Svm Classifier ◽

Arabic Text ◽

Text Documents ◽

Information Retrieval Systems ◽

Arabic Text Classification ◽

The Web

The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.

Download Full-text

Improving Arabic Text Classification Using P-Stemmer

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904114023 ◽

2020 ◽

Vol 13 ◽

Author(s):

Tarek Kanan ◽

Bilal Hawashin ◽

Shadi Alzubi ◽

Eyad Almaita ◽

Ahmad Alkhatib ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Text Categorization ◽

English Language ◽

Arabic Language ◽

Online News ◽

Support Vector ◽

Arabic Text ◽

Fast Learning ◽

Arabic Text Classification

Introduction: Stemming is an important preprocessing step in text classification, and could contribute in increasing text classification accuracy. Although many works proposed stemmers for English language, few stemmers were proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need is vital to further improve Arabic text classification. Method: This work combined the use of the recently proposed P-Stemmer with various classifiers to find the optimal classifier for the P-stemmer in term of Arabic text classification. As part of this work, a synthesized dataset was collected. Result: The previous experiments show that the use of P-Stemmer has a positive effect on classification. The degree of improvement was classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer was NB. This is an interesting result as this classifier is wellknown for its fast learning and classification time. Discussion: First, the continuous improvement of the P-Stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue. Conclusion: In this work, an improved P-Stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-Stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, KNearest Neighbor, and K-Star.

Download Full-text

Arabic Text Mining Using Rule Based Classification

Journal of Information & Knowledge Management ◽

10.1142/s0219649212500062 ◽

2012 ◽

Vol 11 (01) ◽

pp. 1250006 ◽

Cited By ~ 5

Author(s):

Fadi Thabtah ◽

Omar Gharaibeh ◽

Rashid Al-Zubaidy

Keyword(s):

Text Mining ◽

Text Classification ◽

Business Intelligence ◽

Classification Problem ◽

Decision Making Process ◽

Classification Algorithms ◽

Arabic Text ◽

Essential Information ◽

Rule Based ◽

Arabic Text Classification

A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studies conducted on other languages like Arabic. In this respect, the paper proposes to investigate the problem of Arabic text classification comprehensively. More specifically the study measures the performance of different rule based classification approaches adopted from machine learning and data mining towards the problem of text Arabic classification. In particular, four different rule based classification approaches: Decision trees (C4.5), Rule Induction (RIPPER), Hybrid (PART) and Simple Rule (One Rule) are evaluated against the published Corpus of Contemporary Arabic Arabic text collection. This experimentation is carried out by employing a modified version of WEKA business intelligence tool. Through analysing the produced results from the experimentation, we determine the most suitable classification algorithms for classifying Arabic texts.

Download Full-text

A MACHINE LEARNING APPROACH BASED ON SVM FOR CLASSIFICATION OF LIVER DISEASES

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237220500180 ◽

2020 ◽

Vol 32 (03) ◽

pp. 2050018

Author(s):

Mohammad Fathi ◽

Mohammadreza Nemati ◽

Seyed Mohsen Mohammadi ◽

Reza Abbasi-Kesbi

Keyword(s):

Cross Validation ◽

The Body ◽

Support Vector ◽

Classification Algorithms ◽

Data Partition ◽

Linear Quadratic ◽

Machine Learning Approach ◽

And Performance ◽

Fold Cross Validation

The liver is an organ in the body that plays an important role in the production and secretion of the bile. Recently, the number of liver patients are increasing because of the inhalation of harmful gases, the consumption of contaminated foods, herbs, and narcotics. Today, classification algorithms are widely used in diverse medical applications. In this paper, the classification of the liver, and non-liver patients is performed based on a support vector machine (SVM) on two datasets. To this end, the dataset is normalized and then sorted based on a proposed algorithm. After that, the feature selection is performed in order to remove the outliers and missing data. Then, 10-fold cross-validation is used for the data partition. In the end, the classification models of Linear, Quadratic and Gaussian SVM are defined and performance evaluation of the proposed method is investigated by calculation of F1-score, accuracy, and sensitivity. The results show that ILPD data have maximum accuracy, sensitivity, and F1-score of 90.9%, 89.2%, and 94%, respectively, so that a minimum improvement of 17.9% is obtained in accuracy than previous works. Additionally, the highest accuracy, sensitivity, and F1-score of BUPA data is 92.2%, 89%, and 94.3%, separately.

Download Full-text