scholarly journals Keyword Extraction from Arabic Text using the Page Rank Algorithm

This paper describes how keywords are extracted from Arabic text using the page rank algorithm, by constructing a graph whose vertices are formed by candidate words that are extracted from the title and the abstract of a given Arabic text after applying a tagging filter to that text. Next, a co-occurrence relation is applied to draw the edges between the vertices within specified window sizes. Then, the page rank algorithm is applied to the graph to rank the importance of each keyword. Finally, the vertices are sorted in descending order by their page rank scores and the tokens with highest scores are chosen as the keywords. Several experiments were conducted on a dataset that consisted of 100 Arabic academic articles for training and 50 for testing. The results were evaluated by using precision, recall, and the F-measure. The maximum recall achieved on the dataset was 63%, as not all the manually identified keywords and keyphrases existed in the article abstracts and titles. The proposed method achieved 25% of recall, which is acceptable as it is comparable to that of a method in the literature that was applied to an English language testing dataset that consisted of 500 English documents, which achieved 42% of recall where the maximum recall percentage of the testing dataset was 78%. Despite the difficulties and challenges in searching for keywords in the Arabic language and using fewer documents in the Arabic testing dataset than in the English, it can be concluded that the proposed keyword and keyphrase extraction system using the page rank algorithm works well

Author(s):  
Suhaibah Jusoh ◽  
Aida Mustapha ◽  
Azizan Ismail ◽  
Roshidi Din

<span>Steganography is a strategy for hiding secret information in a cover document in order to avoid attacker from predict about hidden information. Steganography exploit cover message, for instance text, audio, picture and video to hide the secret message. Before this, linguistic text steganographic techniques are implemented just for the English language. But nowadays different languages are used to hide the data like Arabic language. This language is still new in the steganography and still need practices for empowerment. This paper will present the text steganographic method for Arabic language, scholar paper within 5 year will be analyze and compared. The main objective of this paper is to give the comparative analysis in the Arabic steganography method that has been applied by previous researchers. Finally, the disadvantage and advantage of the method also will be presented in this paper.</span>


Author(s):  
Tarek Kanan ◽  
Bilal Hawashin ◽  
Shadi Alzubi ◽  
Eyad Almaita ◽  
Ahmad Alkhatib ◽  
...  

Introduction: Stemming is an important preprocessing step in text classification, and could contribute in increasing text classification accuracy. Although many works proposed stemmers for English language, few stemmers were proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need is vital to further improve Arabic text classification. Method: This work combined the use of the recently proposed P-Stemmer with various classifiers to find the optimal classifier for the P-stemmer in term of Arabic text classification. As part of this work, a synthesized dataset was collected. Result: The previous experiments show that the use of P-Stemmer has a positive effect on classification. The degree of improvement was classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer was NB. This is an interesting result as this classifier is wellknown for its fast learning and classification time. Discussion: First, the continuous improvement of the P-Stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue. Conclusion: In this work, an improved P-Stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-Stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, KNearest Neighbor, and K-Star.


2021 ◽  
Author(s):  
Thomas Hegghammer

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n=322) and Arabic-language article scans (n=100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) were substantially more accurate than Tesseract, especially on noisy documents. Accuracy for English was considerably better than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available "Noisy OCR Dataset" (NOD).


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 04) ◽  
pp. 319-326
Author(s):  
Ammar Sabeeh Hmoud Altamimi ◽  
Ali Mohsin Kaittan

Most encryption techniques are deals with English language, but that deals with Arabic language are few. Therefore, many researchers interests with encryption ciphers that applied on text which wrote in Arabic language. This reason is behind this paper. In this paper, there are three cipher methods implemented together on Arabic text. Using more than one cipher method is increase the security of algorithm used. Each letter of plaintext is encrypted by a specified cipher method. Selection process of one of three cipher methods used in this work is done by controlling process that selects one cipher method to encrypt one letter of plaintext. The cipher methods that used in this paper are RSA, Playfair and Vignere. Each one of them has different basis mathematical model. This proposed encryption Arabic text method gives results better than previous related papers.


Author(s):  
Thomas Hegghammer

AbstractOptical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.


2010 ◽  
Vol 12 (1-2) ◽  
pp. 337-314
Author(s):  
ʿAbd Allāh Muḥammad al-Shāmī

The question of clarifying the meaning of a given Arabic text is a subtle one, especially as high literature texts can often be read in more than one way. Arabic is rich in figurative language and this can lead to variety in meaning, sometimes in ways that either adhere closely or diverge far from the ‘original’ meaning. In order to understand a fine literary text in Arabic, one must have a comprehensive understanding of the issue of taʾwīl, and the concept that multiplicity of meaning does not necessarily lead to contradiction. This article surveys the opinions of various literary critics and scholars of balāgha on this issue with a brief discussion of the concepts of tafsīr and sharḥ, which sometimes overlap with taʾwīl.


2019 ◽  
Vol 8 (2) ◽  
Author(s):  
Dinh Thi Bac Binh ◽  
Dinh Thi Kieu Trinh

The International English Language Testing System (IELTS) is recognized as an accountable tool to assess whether aperson is able to study or train in English. Every year, thousandsof students sit for IELTS. However, the number of those who arerecognized to be capable enough to take a course in English issomehow limited, especially for those who are not major inEnglish at their universities.IELTS Reading is considered as a discerning skill and it is of theequal importance to listening, speaking and writing in obtainingthe objectives of IELTS of band 6 or 6.5. Being teachers of Englishat a training institution, the authors recognize that students canmake time-saving improvements in their reading command undertheir teachers’ insightful guidance.


2021 ◽  
Vol 11 (15) ◽  
pp. 6851
Author(s):  
Reema Thabit ◽  
Nur Izura Udzir ◽  
Sharifah Md Yasin ◽  
Aziah Asmawi ◽  
Nuur Alifah Roslan ◽  
...  

Protecting sensitive information transmitted via public channels is a significant issue faced by governments, militaries, organizations, and individuals. Steganography protects the secret information by concealing it in a transferred object such as video, audio, image, text, network, or DNA. As text uses low bandwidth, it is commonly used by Internet users in their daily activities, resulting a vast amount of text messages sent daily as social media posts and documents. Accordingly, text is the ideal object to be used in steganography, since hiding a secret message in a text makes it difficult for the attacker to detect the hidden message among the massive text content on the Internet. Language’s characteristics are utilized in text steganography. Despite the richness of the Arabic language in linguistic characteristics, only a few studies have been conducted in Arabic text steganography. To draw further attention to Arabic text steganography prospects, this paper reviews the classifications of these methods from its inception. For analysis, this paper presents a comprehensive study based on the key evaluation criteria (i.e., capacity, invisibility, robustness, and security). It opens new areas for further research based on the trends in this field.


2014 ◽  
Vol 4 (1) ◽  
pp. 29-45 ◽  
Author(s):  
Rami Ayadi ◽  
Mohsen Maraoui ◽  
Mounir Zrigui

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.


Author(s):  
Ahmed Maher Khafaga Shehata ◽  
Amr Hassan Fatouh Hassan

The purpose of this paper is to report the findings of a study of information-seeking behavior among a group of Arab postgraduate students in social science and humanities disciplines. The paper also explores information-seeking styles and examines how information seeking is affected by external factors. The study employed a qualitative approach to explore informationseeking behavior in the sample and the sources of information used to obtain scholarly information. A sample of 33 participants was interviewed to elucidate the information-seeking behavior of the Arabic language speakers. The analysis of the interviews revealed that the participants use different methods to find information on the internet. These methods vary from using search engines to using sites that provide pirated scholarly papers. The data showed that most of the sample students use search engines and databases provided by their universities, but they should be trained in research ethics to avoid unacceptable research practices. The results also indicate that searching in other languages represents a challenge for Arab postgraduates in the social sciences and humanities. This study was conducted with social science and humanities postgraduates as part of a series of studies aiming to explore Arab language speakers' scholarly practices. The information-seeking behavior of science disciplines may differ, as the teaching language is mainly in English. This study contributes to the field by expanding our understanding of how non-English language speakers seek scholarly information and what sources are used to obtain the scholarly papers.


Sign in / Sign up

Export Citation Format

Share Document