scholarly journals I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical Theory

Author(s):  
Dana Halabi ◽  
Ebaa Fayyoumi ◽  
Arafat Awajan

Treebanks are valuable linguistic resources that include the syntactic structure of a language sentence in addition to part-of-speech tags and morphological features. They are mainly utilized in modeling statistical parsers. Although the statistical natural language parser has recently become more accurate for languages such as English, those for the Arabic language still have low accuracy. The purpose of this article is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language, to investigate their effects on the accuracy of statistical parsers. The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts. The first concept is the approach of determining the main word of the sentence, and the second concept is the representation of the joined and covert pronouns. To evaluate I3rab, we compared its performance against a subset of Prague Arabic Dependency Treebank that shares a comparable level of details. The conducted experiments show that the percentage improvement reached up to 10.24% in UAS and 18.42% in LAS.

2021 ◽  
pp. 1-52
Author(s):  
Marie-Catherine de Marneffe ◽  
Christopher D. Manning ◽  
Joakim Nivre ◽  
Daniel Zeman

Abstract Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.


2020 ◽  
Vol 30 (1) ◽  
pp. 287-296
Author(s):  
Shihadeh Alqrainy ◽  
Muhammed Alawairdhi

Abstract This paper presents a comprehensive Tag set as a fundamental component for developing an automated Word Class/Part-of-Speech (PoS) tagging system for the Arabic language. The aim is to develop a standard and comprehensive PoS tag set that based upon PoS classes and Arabic inflectional morphology useful for Linguistics and Natural Language Processing (NLP) developers to extract more linguistic information from it. The tag names in the developed tag set uses terminology from Arabic tradition grammar rather than English grammar. The usability of the presented Tag set has been tested in manual tagging and built up a set of tagged text to serve as a goal corpus used to compare it with the results obtained from the tagger. The tagger has achieved an average accuracy of 90% using the developed detailed tag set.


Author(s):  
Shuhrat Mirziyatov ◽  

This article, devoted to the analysis of parts of speech in the works of Makhmud Zamakhshari, addresses the question of conjugation of verbs in the last chapter named “Tasrifu-l-af’al” of the book “Mukaddamatu-l-adab”. The article emphasizes that the verb is an important part of speech in Arabic, that it is impossible to master the grammatical rules and categories without knowing its morphological features, that some parts of speech, especially masdars, the degrees of adjectives are formed from verbal roots. In “Mukaddamatu-l-Adab” was written that verbs in Arabic are divided into verbs with three and four roots and the majority are the verbs with three roots. Verbs with four roots, as well as verbs with three roots, lean with the help of those suffixes and prefixes. In the formation of the present tense forms, imperative forms, masdars, participles are also based on the same rules as for three-verbs. Makhmud Zamakhshari, defining the doubled verbs as verbs in the three-root group, in which the second and third roots consist of the same letter, emphasizes that the hamza is a “healthy” letter, not defective, and because of its complex pronunciation it is either changed with another letter or sometimes it is missed when pronounced and this provides ease of pronunciation. The question of writing hamza and its spelling has always been a difficult question of the language. Since Zamakhshari created his work for the quick study of Arabic and its grammar by non-Arab people, he did not go deeply into the essence of some difficult questions of Arabic language. The scientist notices that ings are added to the verbs of the actual voice gives samples conjugation of regular verbs in the past tense, and says that all regular verbs and verbs that are similar to regular verbs are conjugated in the above order. In his work, Zamahshari gave a sample of the conjugations of the verbs of the passive voice and examples of adding personal endings to such verbs, as well as conjugations of regular verbs, and verbs similar to regular verbs, empty and defective verbs. The scholar’s work not only gave conjugation of verbs, but also provided exceptions to the rules, it also highlighted a separate chapter in the interpretation of the imperative form in Arabic. The work contains information that the formation of an imperative form from verbs of the present-future tense. The article emphasizes that the verbs of surprise are formed only from the first chapter of the three-root verbs, that such forms are not formed from verbs expressing physical imperfection. Ways of expressing astonishment from doubled and defective verbs are commented. Regarding the verb conjugation, which is devoted to the chapter on the study of infinitives (masdar), the author dwells on the names of actions, ways of forming masdars from empty verbs, gives definition to real and passive participles, gives examples of their formation. This chapter provides information on the formation of real and passive participles from the derived chapters and four-root verbs, an interpretation of the adjective forms of the excellent and comparative degrees.


2013 ◽  
Vol 6 (1) ◽  
pp. 43-99 ◽  
Author(s):  
Majdi Sawalha ◽  
Eric Atwell

The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Fridah Katushemererwe ◽  
Andrew Caines ◽  
Paula Buttery

AbstractThis paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.


2014 ◽  
Vol 40 (2) ◽  
pp. 469-510 ◽  
Author(s):  
Khaled Shaalan

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.


2021 ◽  
Author(s):  
Moez Krichen ◽  
Seifeddine Mechti

<div>We propose a new model-based testing approach which takes as input a set of requirements described in Arabic Controlled Natural Language (CNL) which is a subset of the Arabic language generated by a specific grammar. The semantics of the considered requirements is defined using the Case Grammar Theory (CTG). The requirements are translated into Transition Relations which serve as an input for test cases generation tools.</div>


2022 ◽  
pp. 1-13
Author(s):  
Denis Paperno

Abstract Can recurrent neural nets, inspired by human sequential data processing, learn to understand language? We construct simplified datasets reflecting core properties of natural language as modeled in formal syntax and semantics: recursive syntactic structure and compositionality. We find LSTM and GRU networks to generalise to compositional interpretation well, but only in the most favorable learning settings, with a well-paced curriculum, extensive training data, and left-to-right (but not right-to-left) composition.


2021 ◽  
Author(s):  
Moez Krichen ◽  
Seifeddine Mechti

<div>We propose a new model-based testing approach which takes as input a set of requirements described in Arabic Controlled Natural Language (CNL) which is a subset of the Arabic language generated by a specific grammar. The semantics of the considered requirements is defined using the Case Grammar Theory (CTG). The requirements are translated into Transition Relations which serve as an input for test cases generation tools.</div>


The software development procedure begins with identifying the requirement analysis. The process levels of the requirements start from analysing the requirements to sketch the design of the program, which is very critical work for programmers and software engineers. Moreover, many errors will happen during the requirement analysis cycle transferring to other stages, which leads to the high cost of the process more than the initial specified process. The reason behind this is because of the specifications of software requirements created in the natural language. To minimize these errors, we can transfer the software requirements to the computerized form by the UML diagram. To overcome this, a device has been designed, which plans can provide semi-automatized aid for designers to provide UML class version from software program specifications using natural Language Processing techniques. The proposed technique outlines the class diagram in a well-known configuration and additionally facts out the relationship between instructions. In this research, we propose to enhance the procedure of producing the UML diagrams by utilizing the Natural Language, which will help the software development to analyze the software requirements with fewer errors and efficient way. The proposed approach will use the parser analyze and Part of Speech (POS) tagger to analyze the user requirements entered by the user in the English language. Then, extract the verbs and phrases, etc. in the user text. The obtained results showed that the proposed method got better results in comparison with other methods published in the literature. The proposed method gave a better analysis of the given requirements and better diagrams presentation, which can help the software engineers. Key words: Part of Speech,UM


Sign in / Sign up

Export Citation Format

Share Document