Development of Automatic Rule-based Semantic Tagger and Karaka Analyzer for Hindi

Author(s):  
Pragya Katyayan ◽  
Nisheeth Joshi

Hindi is the third most-spoken language in the world (615 million speakers) and has the fourth highest native speakers (341 million). It is an inflectionally rich and relatively free word-order language with an immense vocabulary set. Despite being such a celebrated language across the globe, very few Natural Language Processing (NLP) applications and tools have been developed to support it computationally. Moreover, most of the existing ones are not efficient enough due to the lack of semantic information (or contextual knowledge). Hindi grammar is based on Paninian grammar and derives most of its rules from it. Paninian grammar very aggressively highlights the role of karaka theory in free-word order languages. In this article, we present an application that extracts all possible karakas from simple Hindi sentences with an accuracy of 84.2% and an F1 score of 88.5%. We consider features such as Parts of Speech tags, post-position markers (vibhaktis), semantic tags for nouns and syntactic structure to grab the context in different-sized word windows within a sentence. With the help of these features, we built a rule-based inference engine to extract karakas from a sentence. The application takes in a text file with clean (without punctuation) simple Hindi sentences and gives back karaka tagged sentences in a separate text file as output.

1995 ◽  
Vol 04 (03) ◽  
pp. 301-321 ◽  
Author(s):  
S.E. MICHOS ◽  
N. FAKOTAKIS ◽  
G. KOKKINAKIS

This paper deals with the problems stemming from the parsing of long sentences in quasi free word order languages. Due to the word order freedom of a large category of languages including Greek and the limitations of rule-based grammar parsers in parsing unrestricted texts of such languages, we propose a flexible and effective method for parsing long sentences of such languages that combines heuristic information and pattern-matching techniques in early processing levels. This method is deeply characterized by its simplicity and robustness. Although it has been developed and tested for the Greek language, its theoretical background, implementation algorithm and results are language independent and can be of considerable value for many practical natural language processing (NLP) applications involving parsing of unrestricted texts.


2019 ◽  
pp. 002383091988408 ◽  
Author(s):  
Tatiana Luchkina ◽  
Jennifer S. Cole

This study examines the contribution of constituent order, prosody, and information structure to the perception of word-level prominence in Russian, a free word order language. Prominence perception is investigated through the analysis of prominence ratings of nominal words in two published narrative texts. Word-level perceived prominence ratings were obtained from linguistically naïve native speakers of Russian in two tasks: a silent prominence rating task of the read text passages, and an auditory prominence rating task of the same texts as read aloud by a native Russian speaker. Analyses of the prominence ratings reveal a greater likelihood of perceived prominence for words introducing discourse-new referents, as well as words occurring in a non-canonical sentence position, and featuring acoustic-prosodic enhancement. The results show that prosody and word order vary probabilistically in relation to information structure in read-aloud narrative, suggesting a complex interaction of prosody, word order, and information structure underlying the perception of prominence.


Author(s):  
A. M. Devine ◽  
Laurence D. Stephens

Latin is often described as a free word order language, but in general each word order encodes a particular information structure: in that sense, each word order has a different meaning. This book provides a descriptive analysis of Latin information structure based on detailed philological evidence and elaborates a syntax-pragmatics interface that formalizes the informational content of the various different word orders. The book covers a wide ranges of issues including broad scope focus, narrow scope focus, double focus, topicalization, tails, focus alternates, association with focus, scrambling, informational structure inside the noun phrase and hyperbaton (discontinuous constituency). Using a slightly adjusted version of the structured meanings theory, the book shows how the pragmatic meanings matching the different word orders arise naturally and spontaneously out of the compositional process as an integral part of a single semantic derivation covering denotational and informational meaning at one and the same time.


2011 ◽  
Vol 64 (2) ◽  
Author(s):  
Stavros Skopeteas

AbstractClassical Latin is a free word order language, i.e., the order of the constituents is determined by information structure rather than by syntactic rules. This article presents a corpus study on the word order of locative constructions and shows that the choice between a Theme-first and a Locative-first order is influenced by the discourse status of the referents. Furthermore, the corpus findings reveal a striking impact of the syntactic construction: complements of motion verbs do not have the same ordering preferences with complements of static verbs and adjuncts. This finding supports the view that the influence of discourse status on word order is indirect, i.e., it is mediated by information structural domains.


1986 ◽  
Vol 1 (3) ◽  
pp. 123-128
Author(s):  
K.-S. CHOI
Keyword(s):  

2013 ◽  
Vol 4 (2) ◽  
pp. 174-184 ◽  
Author(s):  
Deniz Zeyrek ◽  
Işın Demirşahin ◽  
Ayışığı B. Sevdik Çallı

This paper briefly describes the Turkish Discourse Bank, the first publicly available annotated discourse resource for Turkish. It focuses on the challenges posed by annotating Turkish, a free word order language with rich inflectional and derivational morphology. It shows the usefulness of the PDTB style annotation but points out the need to expand this annotation style with the needs of the target language.


2020 ◽  
Vol 10 (2) ◽  
pp. 255
Author(s):  
Zafar Iqbal Bhatti ◽  
Muhammad Asad Habib ◽  
Tamsila Naeem

The aim of this paper is to explore the number system in Thali, a variety of Punjabi spoken by natives of Thal desert. There are three number categories singular, dual, and plural but all modern Indo Aryan languages have only singular and plural (Bashir & Kazmi, 2012, p. 119). It is one of the indigenous languages of Pakistan from the Lahnda group as described by Grierson (1819) in his benchmark book Linguistic Survey of India. Layyah is one of the prominent areas of Thal regions. The native speakers of Thali use this sub dialect of Saraiki in their household and professional life. The linguistic boundaries of the present Siraiki belt have changed under different linguistic variational rules as described by Labov (1963), Trudgal (2004), Eckert (2002) and Meryhoff (2008). There are many differences between Thali and Saraiki, on phonological, morphological and orthographical levels. Husain (2017) has pointed out linguistic differences between Saraiki and Lahnda and Thali is one of the popular languages of Lahnda spoken in different parts of Thal regions. According to the local language activists, Thali has been greatly influenced by Saraiki and Punjabi. The lexicon of Thali is composed for 20% of Punjabi, 45% of Saraiki, and 5% of loan words particularly English. Another particularity is that Perso-Arabic characters are used to write Thali. The most distinguishing characteristics of Thali are its parts of speech, word order, case marking, verb conjugation and, finally, usage of grammatical categories in terms of number, person, tense, voice and gender. In this perspective, number marking is the area to focus on noun morphology and exclusively on the recognition of number system in Thali nouns. The analysis of linguistic systems including grammar, lexicon, and phonology provide sound justifications of number marking systems in languages of the world (Chohan & García, 2019).


Part of speech tagging is the initial step in development of NLP (natural language processing) application. POS Tagging is sequence labelling task in which we assign Part-of-speech to every word (Wi) which is sequence in sentence and tag (Ti) to corresponding word as label such as (Wi/Ti…. Wn/Tn). In this research project part of speech tagging is perform on Hindi. Hindi is the fourth most popular language and spoken by approximately 4billion people across the globe. Hindi is free word-order language and morphologically rich language due to this applying Part of Speech tagging is very challenging task. In this paper we have shown the development of POS tagging using neural approach.


2018 ◽  
Vol 9 (1) ◽  
pp. 23-32
Author(s):  
Surjya Kanta Daimary ◽  
Vishal Goyal ◽  
Madhumita Barbora ◽  
Umrinderpal Singh

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.


Sign in / Sign up

Export Citation Format

Share Document