A machine learning approach for Arabic text classification using N-gram frequency statistics

2009 ◽  
Vol 3 (1) ◽  
pp. 72-77 ◽  
Author(s):  
Laila Khreisat
2016 ◽  
Vol 57 ◽  
pp. 117-126 ◽  
Author(s):  
Abinash Tripathy ◽  
Ankit Agrawal ◽  
Santanu Kumar Rath

2017 ◽  
Vol 69 ◽  
pp. 40-58 ◽  
Author(s):  
Thiago Salles ◽  
Leonardo Rocha ◽  
Fernando Mourão ◽  
Marcos Gonçalves ◽  
Felipe Viegas ◽  
...  

2021 ◽  
Author(s):  
Dana Dannélls ◽  
Shafqat Virk

Training machine learning models with high accuracy requires careful feature engineering, which involves finding the best feature combinations and extracting their values from the data. The task becomes extremely laborious for specific problems such as post Optical Character Recognition (OCR) error detection because of the diversity of errors in the data. In this paper we present a machine learning approach which exploits character n-gram statistics as the only feature for the OCR error detection task. Our method achieves a significant improvement over the baseline reaching state-of-the-art results of 91% and 89% F1 score on English and Swedish datasets respectively. We report various experiments to select the appropriate machine learning algorithm and to compare our approach to previously reported traditional approaches.


Author(s):  
Syed Md. Minhaz Hossain ◽  
Iqbal H. Sarker

Recently, spam emails have become a significant problem with the expanding usage of the Internet. It is to some extend obvious to filter emails. A spam filter is a system that detects undesired and malicious emails and blocks them from getting into the users' inboxes. Spam filters check emails for something "suspicious" in terms of text, email address, header, attachments, and language. However, we have used different features such as word2vec, word n-grams, character n-grams, and a combination of variable length n-grams for comparative analysis in our proposed approach. Different machine learning models such as support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial naïve bayes (MNB) are applied to train the extracted features. We use different evaluation metrics such as precision, recall, f1-score, and accuracy to evaluate the experimental results. Among them, SVM provides 97.6 \% of accuracy, 98.8\% of precision, and 94.9\% of f1-score using a combination of n-gram features.


Sign in / Sign up

Export Citation Format

Share Document