Development of Automatic Rule-based Semantic Tagger and Karaka Analyzer for Hindi

Hindi is the third most-spoken language in the world (615 million speakers) and has the fourth highest native speakers (341 million). It is an inflectionally rich and relatively free word-order language with an immense vocabulary set. Despite being such a celebrated language across the globe, very few Natural Language Processing (NLP) applications and tools have been developed to support it computationally. Moreover, most of the existing ones are not efficient enough due to the lack of semantic information (or contextual knowledge). Hindi grammar is based on Paninian grammar and derives most of its rules from it. Paninian grammar very aggressively highlights the role of karaka theory in free-word order languages. In this article, we present an application that extracts all possible karakas from simple Hindi sentences with an accuracy of 84.2% and an F1 score of 88.5%. We consider features such as Parts of Speech tags, post-position markers (vibhaktis), semantic tags for nouns and syntactic structure to grab the context in different-sized word windows within a sentence. With the help of these features, we built a rule-based inference engine to extract karakas from a sentence. The application takes in a text file with clean (without punctuation) simple Hindi sentences and gives back karaka tagged sentences in a separate text file as output.

Download Full-text

A NOVEL AND EFFICIENT METHOD FOR PARSING UNRESTRICTED TEXTS OF QUASI FREE WORD ORDER LANGUAGES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213095000152 ◽

1995 ◽

Vol 04 (03) ◽

pp. 301-321 ◽

Cited By ~ 3

Author(s):

S.E. MICHOS ◽

N. FAKOTAKIS ◽

G. KOKKINAKIS

Keyword(s):

Pattern Matching ◽

Language Processing ◽

Word Order ◽

Theoretical Background ◽

Greek Language ◽

Rule Based ◽

Early Processing ◽

Matching Techniques ◽

Free Word ◽

Large Category

This paper deals with the problems stemming from the parsing of long sentences in quasi free word order languages. Due to the word order freedom of a large category of languages including Greek and the limitations of rule-based grammar parsers in parsing unrestricted texts of such languages, we propose a flexible and effective method for parsing long sentences of such languages that combines heuristic information and pattern-matching techniques in early processing levels. This method is deeply characterized by its simplicity and robustness. Although it has been developed and tested for the Greek language, its theoretical background, implementation algorithm and results are language independent and can be of considerable value for many practical natural language processing (NLP) applications involving parsing of unrestricted texts.

Download Full-text

Perception of Word-level Prominence in Free Word Order Language Discourse

Language and Speech ◽

10.1177/0023830919884089 ◽

2019 ◽

pp. 002383091988408 ◽

Cited By ~ 1

Author(s):

Tatiana Luchkina ◽

Jennifer S. Cole

Keyword(s):

Word Order ◽

Information Structure ◽

Native Speakers ◽

Read Aloud ◽

Rating Task ◽

Narrative Texts ◽

Word Level ◽

Order Language ◽

Russian Speaker ◽

Free Word

This study examines the contribution of constituent order, prosody, and information structure to the perception of word-level prominence in Russian, a free word order language. Prominence perception is investigated through the analysis of prominence ratings of nominal words in two published narrative texts. Word-level perceived prominence ratings were obtained from linguistically naïve native speakers of Russian in two tasks: a silent prominence rating task of the read text passages, and an auditory prominence rating task of the same texts as read aloud by a native Russian speaker. Analyses of the prominence ratings reveal a greater likelihood of perceived prominence for words introducing discourse-new referents, as well as words occurring in a non-canonical sentence position, and featuring acoustic-prosodic enhancement. The results show that prosody and word order vary probabilistically in relation to information structure in read-aloud narrative, suggesting a complex interaction of prosody, word order, and information structure underlying the perception of prominence.

Download Full-text

Pragmatics for Latin

10.1093/oso/9780190939472.001.0001 ◽

2019 ◽

Author(s):

A. M. Devine ◽

Laurence D. Stephens

Keyword(s):

Noun Phrase ◽

Word Order ◽

Information Structure ◽

Descriptive Analysis ◽

Informational Content ◽

Compositional Process ◽

Association With Focus ◽

Narrow Scope ◽

Order Language ◽

Free Word

Latin is often described as a free word order language, but in general each word order encodes a particular information structure: in that sense, each word order has a different meaning. This book provides a descriptive analysis of Latin information structure based on detailed philological evidence and elaborates a syntax-pragmatics interface that formalizes the informational content of the various different word orders. The book covers a wide ranges of issues including broad scope focus, narrow scope focus, double focus, topicalization, tails, focus alternates, association with focus, scrambling, informational structure inside the noun phrase and hyperbaton (discontinuous constituency). Using a slightly adjusted version of the structured meanings theory, the book shows how the pragmatic meanings matching the different word orders arise naturally and spontaneously out of the compositional process as an integral part of a single semantic derivation covering denotational and informational meaning at one and the same time.

Download Full-text

Discontinuous Constituents in a Free Word Order Language

Discontinuous Constituency ◽

10.1163/9789004373204_010 ◽

1987 ◽

pp. 241-255

Keyword(s):

Word Order ◽

Order Language ◽

Free Word

Download Full-text

Word order in Latin locative constructions: a corpus study in Caesar’s De bello gallico

Language Typology and Universals ◽

10.1524/stuf.2011.0014 ◽

2011 ◽

Vol 64 (2) ◽

Author(s):

Stavros Skopeteas

Keyword(s):

Word Order ◽

Information Structure ◽

Motion Verbs ◽

Structural Domains ◽

Corpus Study ◽

First Order ◽

Order Language ◽

Free Word

AbstractClassical Latin is a free word order language, i.e., the order of the constituents is determined by information structure rather than by syntactic rules. This article presents a corpus study on the word order of locative constructions and shows that the choice between a Theme-first and a Locative-first order is influenced by the discourse status of the referents. Furthermore, the corpus findings reveal a striking impact of the syntactic construction: complements of motion verbs do not have the same ordering preferences with complements of static verbs and adjuncts. This finding supports the view that the influence of discourse status on word order is indirect, i.e., it is mediated by information structural domains.

Download Full-text

Parsing Korean: A Free Word Order Language

Literary and Linguistic Computing ◽

10.1093/llc/1.3.123 ◽

1986 ◽

Vol 1 (3) ◽

pp. 123-128

Author(s):

K.-S. CHOI

Keyword(s):

Word Order ◽

Order Language ◽

Free Word

Download Full-text

Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language

Dialogue & Discourse ◽

10.5087/dad.2013.208 ◽

2013 ◽

Vol 4 (2) ◽

pp. 174-184 ◽

Cited By ~ 10

Author(s):

Deniz Zeyrek ◽

Işın Demirşahin ◽

Ayışığı B. Sevdik Çallı

Keyword(s):

Word Order ◽

Target Language ◽

Derivational Morphology ◽

Paper Briefly ◽

Order Language ◽

Free Word

This paper briefly describes the Turkish Discourse Bank, the first publicly available annotated discourse resource for Turkish. It focuses on the challenges posed by annotating Turkish, a free word order language with rich inflectional and derivational morphology. It shows the usefulness of the PDTB style annotation but points out the need to expand this annotation style with the needs of the target language.

Download Full-text

Number Marking in English and Thali: A Contrastive Study

International Journal of English Linguistics ◽

10.5539/ijel.v10n2p255 ◽

2020 ◽

Vol 10 (2) ◽

pp. 255

Author(s):

Zafar Iqbal Bhatti ◽

Muhammad Asad Habib ◽

Tamsila Naeem

Keyword(s):

Word Order ◽

Native Speakers ◽

Number System ◽

Parts Of Speech ◽

Linguistic Differences ◽

Loan Words ◽

Contrastive Study ◽

Number Marking ◽

Arabic Characters ◽

And Gender

The aim of this paper is to explore the number system in Thali, a variety of Punjabi spoken by natives of Thal desert. There are three number categories singular, dual, and plural but all modern Indo Aryan languages have only singular and plural (Bashir & Kazmi, 2012, p. 119). It is one of the indigenous languages of Pakistan from the Lahnda group as described by Grierson (1819) in his benchmark book Linguistic Survey of India. Layyah is one of the prominent areas of Thal regions. The native speakers of Thali use this sub dialect of Saraiki in their household and professional life. The linguistic boundaries of the present Siraiki belt have changed under different linguistic variational rules as described by Labov (1963), Trudgal (2004), Eckert (2002) and Meryhoff (2008). There are many differences between Thali and Saraiki, on phonological, morphological and orthographical levels. Husain (2017) has pointed out linguistic differences between Saraiki and Lahnda and Thali is one of the popular languages of Lahnda spoken in different parts of Thal regions. According to the local language activists, Thali has been greatly influenced by Saraiki and Punjabi. The lexicon of Thali is composed for 20% of Punjabi, 45% of Saraiki, and 5% of loan words particularly English. Another particularity is that Perso-Arabic characters are used to write Thali. The most distinguishing characteristics of Thali are its parts of speech, word order, case marking, verb conjugation and, finally, usage of grammatical categories in terms of number, person, tense, voice and gender. In this perspective, number marking is the area to focus on noun morphology and exclusively on the recognition of number system in Thali nouns. The analysis of linguistic systems including grammar, lexicon, and phonology provide sound justifications of number marking systems in languages of the world (Chohan & García, 2019).

Download Full-text

Development of Part of Speech Tagger using Deep Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1531.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 3384-3391

Keyword(s):

Language Processing ◽

Initial Step ◽

Processing Application ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Order Language ◽

Popular Language ◽

Speech Tagging ◽

Free Word

Part of speech tagging is the initial step in development of NLP (natural language processing) application. POS Tagging is sequence labelling task in which we assign Part-of-speech to every word (Wi) which is sequence in sentence and tag (Ti) to corresponding word as label such as (Wi/Ti…. Wn/Tn). In this research project part of speech tagging is perform on Hindi. Hindi is the fourth most popular language and spoken by approximately 4billion people across the globe. Hindi is free word-order language and morphologically rich language due to this applying Part of Speech tagging is very challenging task. In this paper we have shown the development of POS tagging using neural approach.

Download Full-text

Development of Part of Speech Tagger for Assamese Using HMM

International Journal of Synthetic Emotions ◽

10.4018/ijse.2018010102 ◽

2018 ◽

Vol 9 (1) ◽

pp. 23-32

Author(s):

Surjya Kanta Daimary ◽

Vishal Goyal ◽

Madhumita Barbora ◽

Umrinderpal Singh

Keyword(s):

South Asian ◽

Language Processing ◽

Word Order ◽

Good Accuracy ◽

Stochastic Approach ◽

Point Of View ◽

Part Of Speech ◽

Pos Tagger ◽

Asian Languages ◽

Free Word

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.

Download Full-text