Keyword Extraction from Arabic Text using the Page Rank Algorithm

This paper describes how keywords are extracted from Arabic text using the page rank algorithm, by constructing a graph whose vertices are formed by candidate words that are extracted from the title and the abstract of a given Arabic text after applying a tagging filter to that text. Next, a co-occurrence relation is applied to draw the edges between the vertices within specified window sizes. Then, the page rank algorithm is applied to the graph to rank the importance of each keyword. Finally, the vertices are sorted in descending order by their page rank scores and the tokens with highest scores are chosen as the keywords. Several experiments were conducted on a dataset that consisted of 100 Arabic academic articles for training and 50 for testing. The results were evaluated by using precision, recall, and the F-measure. The maximum recall achieved on the dataset was 63%, as not all the manually identified keywords and keyphrases existed in the article abstracts and titles. The proposed method achieved 25% of recall, which is acceptable as it is comparable to that of a method in the literature that was applied to an English language testing dataset that consisted of 500 English documents, which achieved 42% of recall where the maximum recall percentage of the testing dataset was 78%. Despite the difficulties and challenges in searching for keywords in the Arabic language and using fewer documents in the Arabic testing dataset than in the English, it can be concluded that the proposed keyword and keyphrase extraction system using the page rank algorithm works well

Download Full-text

A review of arabic text steganography: past and present

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v17.i2.pp1040-1046 ◽

2020 ◽

Vol 17 (2) ◽

pp. 1040

Author(s):

Suhaibah Jusoh ◽

Aida Mustapha ◽

Azizan Ismail ◽

Roshidi Din

Keyword(s):

Comparative Analysis ◽

English Language ◽

Arabic Language ◽

Secret Message ◽

Arabic Text ◽

Hidden Information ◽

Secret Information ◽

Text Steganography

<span>Steganography is a strategy for hiding secret information in a cover document in order to avoid attacker from predict about hidden information. Steganography exploit cover message, for instance text, audio, picture and video to hide the secret message. Before this, linguistic text steganographic techniques are implemented just for the English language. But nowadays different languages are used to hide the data like Arabic language. This language is still new in the steganography and still need practices for empowerment. This paper will present the text steganographic method for Arabic language, scholar paper within 5 year will be analyze and compared. The main objective of this paper is to give the comparative analysis in the Arabic steganography method that has been applied by previous researchers. Finally, the disadvantage and advantage of the method also will be presented in this paper.</span>

Download Full-text

Improving Arabic Text Classification Using P-Stemmer

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904114023 ◽

2020 ◽

Vol 13 ◽

Author(s):

Tarek Kanan ◽

Bilal Hawashin ◽

Shadi Alzubi ◽

Eyad Almaita ◽

Ahmad Alkhatib ◽

...

Keyword(s):

Language Processing ◽

Text Classification ◽

Text Categorization ◽

English Language ◽

Arabic Language ◽

Online News ◽

Support Vector ◽

Arabic Text ◽

Fast Learning ◽

Arabic Text Classification

Introduction: Stemming is an important preprocessing step in text classification, and could contribute in increasing text classification accuracy. Although many works proposed stemmers for English language, few stemmers were proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need is vital to further improve Arabic text classification. Method: This work combined the use of the recently proposed P-Stemmer with various classifiers to find the optimal classifier for the P-stemmer in term of Arabic text classification. As part of this work, a synthesized dataset was collected. Result: The previous experiments show that the use of P-Stemmer has a positive effect on classification. The degree of improvement was classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer was NB. This is an interesting result as this classifier is wellknown for its fast learning and classification time. Discussion: First, the continuous improvement of the P-Stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue. Conclusion: In this work, an improved P-Stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-Stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, KNearest Neighbor, and K-Star.

Download Full-text

OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment

10.31235/osf.io/6zfvs ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n=322) and Arabic-language article scans (n=100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) were substantially more accurate than Tesseract, especially on noisy documents. Accuracy for English was considerably better than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available "Noisy OCR Dataset" (NOD).

Download Full-text

A Proposed Arabic Text Encryption Method Using Multiple Ciphers

Webology ◽

10.14704/web/v18si04/web18131 ◽

2021 ◽

Vol 18 (Special Issue 04) ◽

pp. 319-326

Author(s):

Ammar Sabeeh Hmoud Altamimi ◽

Ali Mohsin Kaittan

Keyword(s):

Mathematical Model ◽

English Language ◽

Selection Process ◽

Arabic Language ◽

Arabic Text ◽

Method Selection ◽

Encryption Method ◽

Better Than

Most encryption techniques are deals with English language, but that deals with Arabic language are few. Therefore, many researchers interests with encryption ciphers that applied on text which wrote in Arabic language. This reason is behind this paper. In this paper, there are three cipher methods implemented together on Arabic text. Using more than one cipher method is increase the security of algorithm used. Each letter of plaintext is encrypted by a specified cipher method. Selection process of one of three cipher methods used in this work is done by controlling process that selects one cipher method to encrypt one letter of plaintext. The cipher methods that used in this paper are RSA, Playfair and Vignere. Each one of them has different basis mathematical model. This proposed encryption Arabic text method gives results better than previous related papers.

Download Full-text

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Journal of Computational Social Science ◽

10.1007/s42001-021-00149-1 ◽

2021 ◽

Author(s):

Thomas Hegghammer

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

English Language ◽

Computational Analysis ◽

Arabic Language ◽

Artificial Noise ◽

Arabic Text ◽

Optical Character ◽

Different Types ◽

Better Than

AbstractOptical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Download Full-text

Taʾwīl and the Meaning of the Text in the Arabic Language: An Introduction to Understanding the Qur'an

Journal of Qur anic Studies ◽

10.3366/jqs.2010.0119 ◽

2010 ◽

Vol 12 (1-2) ◽

pp. 337-314

Author(s):

ʿAbd Allāh Muḥammad al-Shāmī

Keyword(s):

Figurative Language ◽

Arabic Language ◽

Literary Text ◽

Arabic Text ◽

Comprehensive Understanding ◽

Original Meaning ◽

High Literature

The question of clarifying the meaning of a given Arabic text is a subtle one, especially as high literature texts can often be read in more than one way. Arabic is rich in figurative language and this can lead to variety in meaning, sometimes in ways that either adhere closely or diverge far from the ‘original’ meaning. In order to understand a fine literary text in Arabic, one must have a comprehensive understanding of the issue of taʾwīl, and the concept that multiplicity of meaning does not necessarily lead to contradiction. This article surveys the opinions of various literary critics and scholars of balāgha on this issue with a brief discussion of the concepts of tafsīr and sharḥ, which sometimes overlap with taʾwīl.

Download Full-text

IELTS READING AND SOME TECHNIQUES TO IMPROVE IELTS READING SKILLS FOR STUDENTS

Tạp chí Nghiên cứu dân tộc ◽

10.25073/0866-773x/308 ◽

2019 ◽

Vol 8 (2) ◽

Author(s):

Dinh Thi Bac Binh ◽

Dinh Thi Kieu Trinh

Keyword(s):

English Language ◽

Reading Skills ◽

Language Testing ◽

Training Institution ◽

Testing System ◽

Time Saving ◽

International English

The International English Language Testing System (IELTS) is recognized as an accountable tool to assess whether aperson is able to study or train in English. Every year, thousandsof students sit for IELTS. However, the number of those who arerecognized to be capable enough to take a course in English issomehow limited, especially for those who are not major inEnglish at their universities.IELTS Reading is considered as a discerning skill and it is of theequal importance to listening, speaking and writing in obtainingthe objectives of IELTS of band 6 or 6.5. Being teachers of Englishat a training institution, the authors recognize that students canmake time-saving improvements in their reading command undertheir teachers’ insightful guidance.

Download Full-text

A Comparative Analysis of Arabic Text Steganography

Applied Sciences ◽

10.3390/app11156851 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6851

Author(s):

Reema Thabit ◽

Nur Izura Udzir ◽

Sharifah Md Yasin ◽

Aziah Asmawi ◽

Nuur Alifah Roslan ◽

...

Keyword(s):

Evaluation Criteria ◽

Arabic Language ◽

Text Messages ◽

Secret Message ◽

Sensitive Information ◽

Arabic Text ◽

Ideal Object ◽

Internet Users ◽

Text Steganography ◽

Text Content

Protecting sensitive information transmitted via public channels is a significant issue faced by governments, militaries, organizations, and individuals. Steganography protects the secret information by concealing it in a transferred object such as video, audio, image, text, network, or DNA. As text uses low bandwidth, it is commonly used by Internet users in their daily activities, resulting a vast amount of text messages sent daily as social media posts and documents. Accordingly, text is the ideal object to be used in steganography, since hiding a secret message in a text makes it difficult for the attacker to detect the hidden message among the massive text content on the Internet. Language’s characteristics are utilized in text steganography. Despite the richness of the Arabic language in linguistic characteristics, only a few studies have been conducted in Arabic text steganography. To draw further attention to Arabic text steganography prospects, this paper reviews the classifications of these methods from its inception. For analysis, this paper presents a comprehensive study based on the key evaluation criteria (i.e., capacity, invisibility, robustness, and security). It opens new areas for further research based on the trends in this field.

Download Full-text

Latent Topic Model for Indexing Arabic Documents

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014010102 ◽

2014 ◽

Vol 4 (1) ◽

pp. 29-45 ◽

Cited By ~ 3

Author(s):

Rami Ayadi ◽

Mohsen Maraoui ◽

Mounir Zrigui

Keyword(s):

Topic Model ◽

Inflectional Morphology ◽

Arabic Text ◽

Text Representation ◽

Text Documents ◽

Latent Topic ◽

Latent Topics ◽

F Measure

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

Download Full-text

Information seeking behaviour ofArab social science and humanities postgraduates: Adescriptive study

ARID International Journal of Informetrics ◽

10.36772/arid.aijisc.2021.236 ◽

2021 ◽

pp. 116-137

Author(s):

Ahmed Maher Khafaga Shehata ◽

Amr Hassan Fatouh Hassan

Keyword(s):

Social Science ◽

Search Engines ◽

Information Seeking ◽

English Language ◽

Arabic Language ◽

Information Seeking Behavior ◽

Sources Of Information ◽

Postgraduate Students ◽

Social Sciences And Humanities ◽

Seeking Behavior

The purpose of this paper is to report the findings of a study of information-seeking behavior among a group of Arab postgraduate students in social science and humanities disciplines. The paper also explores information-seeking styles and examines how information seeking is affected by external factors. The study employed a qualitative approach to explore informationseeking behavior in the sample and the sources of information used to obtain scholarly information. A sample of 33 participants was interviewed to elucidate the information-seeking behavior of the Arabic language speakers. The analysis of the interviews revealed that the participants use different methods to find information on the internet. These methods vary from using search engines to using sites that provide pirated scholarly papers. The data showed that most of the sample students use search engines and databases provided by their universities, but they should be trained in research ethics to avoid unacceptable research practices. The results also indicate that searching in other languages represents a challenge for Arab postgraduates in the social sciences and humanities. This study was conducted with social science and humanities postgraduates as part of a series of studies aiming to explore Arab language speakers' scholarly practices. The information-seeking behavior of science disciplines may differ, as the teaching language is mainly in English. This study contributes to the field by expanding our understanding of how non-English language speakers seek scholarly information and what sources are used to obtain the scholarly papers.

Download Full-text