A NOVEL AND EFFICIENT METHOD FOR PARSING UNRESTRICTED TEXTS OF QUASI FREE WORD ORDER LANGUAGES

This paper deals with the problems stemming from the parsing of long sentences in quasi free word order languages. Due to the word order freedom of a large category of languages including Greek and the limitations of rule-based grammar parsers in parsing unrestricted texts of such languages, we propose a flexible and effective method for parsing long sentences of such languages that combines heuristic information and pattern-matching techniques in early processing levels. This method is deeply characterized by its simplicity and robustness. Although it has been developed and tested for the Greek language, its theoretical background, implementation algorithm and results are language independent and can be of considerable value for many practical natural language processing (NLP) applications involving parsing of unrestricted texts.

Download Full-text

Development of Automatic Rule-based Semantic Tagger and Karaka Analyzer for Hindi

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3479155 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-25

Author(s):

Pragya Katyayan ◽

Nisheeth Joshi

Keyword(s):

Language Processing ◽

Word Order ◽

Native Speakers ◽

Syntactic Structure ◽

Contextual Knowledge ◽

Text File ◽

Rule Based ◽

Parts Of Speech ◽

Order Language ◽

Free Word

Hindi is the third most-spoken language in the world (615 million speakers) and has the fourth highest native speakers (341 million). It is an inflectionally rich and relatively free word-order language with an immense vocabulary set. Despite being such a celebrated language across the globe, very few Natural Language Processing (NLP) applications and tools have been developed to support it computationally. Moreover, most of the existing ones are not efficient enough due to the lack of semantic information (or contextual knowledge). Hindi grammar is based on Paninian grammar and derives most of its rules from it. Paninian grammar very aggressively highlights the role of karaka theory in free-word order languages. In this article, we present an application that extracts all possible karakas from simple Hindi sentences with an accuracy of 84.2% and an F1 score of 88.5%. We consider features such as Parts of Speech tags, post-position markers (vibhaktis), semantic tags for nouns and syntactic structure to grab the context in different-sized word windows within a sentence. With the help of these features, we built a rule-based inference engine to extract karakas from a sentence. The application takes in a text file with clean (without punctuation) simple Hindi sentences and gives back karaka tagged sentences in a separate text file as output.

Download Full-text

Development of Part of Speech Tagger for Assamese Using HMM

International Journal of Synthetic Emotions ◽

10.4018/ijse.2018010102 ◽

2018 ◽

Vol 9 (1) ◽

pp. 23-32

Author(s):

Surjya Kanta Daimary ◽

Vishal Goyal ◽

Madhumita Barbora ◽

Umrinderpal Singh

Keyword(s):

South Asian ◽

Language Processing ◽

Word Order ◽

Good Accuracy ◽

Stochastic Approach ◽

Point Of View ◽

Part Of Speech ◽

Pos Tagger ◽

Asian Languages ◽

Free Word

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.

Download Full-text

Automated document metadata extraction

Journal of Information Science ◽

10.1177/0165551509105195 ◽

2009 ◽

Vol 35 (5) ◽

pp. 563-570 ◽

Cited By ~ 5

Author(s):

Bolanle Adefowoke Ojokoh ◽

Olumide Sunday Adewale ◽

Samuel Oluwole Falaki

Keyword(s):

Machine Learning ◽

Pattern Matching ◽

Recall Accuracy ◽

Web Documents ◽

Rule Based ◽

Metadata Extraction ◽

The Future ◽

Matching Techniques ◽

Theses And Dissertations ◽

F Measure

Web documents are available in various forms, most of which do not carry additional semantics. This paper presents a model for general document metadata extraction. The model, which combines segmentation by keywords and pattern matching techniques, was implemented using PHP, MySQL, JavaScript and HTML. The system was tested with 40 randomly selected PDF documents (mainly theses). An evaluation of the system was done using standard criteria measures namely precision, recall, accuracy and F-measure. The results show that the model is relatively effective for the task of metadata extraction, especially for theses and dissertations. A combination of machine learning with these rule-based methods will be explored in the future for better results.

Download Full-text

Syntax overview at units’ level: Syntagma, sentence, phrase, and some correlations with the order of Greek-Albanian constituents in Th.Mitko’s Phrase Book (1887-1888)

European Journal of Multidisciplinary Studies ◽

10.26417/ejms.v6i2.p102-113 ◽

2017 ◽

Vol 6 (2) ◽

pp. 102

Author(s):

Elvis Bramo

Keyword(s):

Word Order ◽

Greek Language ◽

Natural Languages ◽

Language Research ◽

Different Types ◽

Prague School ◽

The University ◽

Relationship Of ◽

The Relationship ◽

Free Word

: In the article “Syntax overview at units’ level: Syntagma, sentence, phrase, and some correlations with the order of their Greek-Albanian constituents in the tri-lingual Talking Dictionary of Th. Mitko”, the author, pedagogue of the Modern Greek Language in the University of Tirana, Elvis Bramo, brings the level of the language as the main topic of this research, that is the syntactical level, starting from the syntagma unit (as a building unit), different types of sentences, some phrases with predicative components, and some bilingual segments: Albanian-Greek, to identify several peculiarities of word order. This comparative study between the two languages ( the Talking Dictionary has been compiled in three languages) aims at achieving some partial conclusions about the construction of the syntagma, their types as far as syntax connecting ways are concerned, and the valences that merge them into classes of words; It aims to identify the types of sentences with the grammatical elements of the question, with question words, with the denial grammatical tools, as well as the characteristics of the verbs as the heart of the syntatical organization in the communicated unit-phrase. Regarding the phrase (period), Bramo has pointed out the relationship of the phrasal components merging, their functioning together with their thematic and rematic role, on the basis of the Prague School. The language research from this viewpoint of Th.Mitko’s work, one of the most famous Albanian folklorists, has also brought in a comparable plan some models of syntactical phrasal and compound structures, to show that although the Greek and the Albanian languages are natural languages with a free word order (SVO), they do have parametric changes regarding the consituent parts of the sentence, particularly in the connoted constructions.

Download Full-text

Automating the generation of lexical patterns for processing free text in clinical documents

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv012 ◽

2015 ◽

Vol 22 (5) ◽

pp. 980-986 ◽

Cited By ~ 2

Author(s):

Frank Meng ◽

Craig Morioka

Keyword(s):

Pattern Matching ◽

Language Processing ◽

Free Text ◽

Syntactic Parsing ◽

Language Usage ◽

Multiple Sequence ◽

Manual Intervention ◽

Data Elements ◽

High Level ◽

Matching Techniques

Abstract Objective Many tasks in natural language processing utilize lexical pattern-matching techniques, including information extraction (IE), negation identification, and syntactic parsing. However, it is generally difficult to derive patterns that achieve acceptable levels of recall while also remaining highly precise. Materials and Methods We present a multiple sequence alignment (MSA)-based technique that automatically generates patterns, thereby leveraging language usage to determine the context of words that influence a given target. MSAs capture the commonalities among word sequences and are able to reveal areas of linguistic stability and variation. In this way, MSAs provide a systemic approach to generating lexical patterns that are generalizable, which will both increase recall levels and maintain high levels of precision. Results The MSA-generated patterns exhibited consistent F1-, F.5-, and F2- scores compared to two baseline techniques for IE across four different tasks. Both baseline techniques performed well for some tasks and less well for others, but MSA was found to consistently perform at a high level for all four tasks. Discussion The performance of MSA on the four extraction tasks indicates the method’s versatility. The results show that the MSA-based patterns are able to handle the extraction of individual data elements as well as relations between two concepts without the need for large amounts of manual intervention. Conclusion We presented an MSA-based framework for generating lexical patterns that showed consistently high levels of both performance and recall over four different extraction tasks when compared to baseline methods.

Download Full-text

Rancang Bangun Aplikasi Chatbot Sebagai Media Pencarian Informasi Anime Menggunakan Regular Expression Pattern Matching

Jurnal ULTIMATICS ◽

10.31937/ti.v9i1.559 ◽

2017 ◽

Vol 9 (1) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

David Domarco ◽

Ni Made Satvika Iswari

Keyword(s):

Information Retrieval ◽

Expression Pattern ◽

Pattern Matching ◽

Language Processing ◽

Regular Expression ◽

Technology Development ◽

Data Retrieval ◽

Index Terms ◽

Retrieval Engine ◽

Behavioral Intention To Use

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching

Download Full-text

Properties of Syntactic Focus in Some Free Word Order Languages

The Journal of Linguistics Science ◽

10.21296/jls.2018.12.87.393 ◽

2018 ◽

Vol 87 ◽

pp. 393-416

Author(s):

Junghyoe Yoon

Keyword(s):

Word Order ◽

Free Word

Download Full-text

An Automatic Question Generation System using Rule-Based Approach in Bloom’s Taxonomy

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666191113143335 ◽

2019 ◽

Vol 13 ◽

Author(s):

G Deena ◽

K Raja ◽

K Kannan

Keyword(s):

Language Processing ◽

Learning Process ◽

Question Generation ◽

Test Question ◽

Rule Based ◽

Part Of Speech ◽

Core Idea ◽

Rule Based Approach ◽

Teaching Learning ◽

Automatic Question Generation

: In this competing world, education has become part of everyday life. The process of imparting the knowledge to the learner through education is the core idea in the Teaching-Learning Process (TLP). An assessment is one way to identify the learner’s weak spot of the area under discussion. An assessment question has higher preferences in judging the learner's skill. In manual preparation, the questions are not assured in excellence and fairness to assess the learner’s cognitive skill. Question generation is the most important part of the teaching-learning process. It is clearly understood that generating the test question is the toughest part. Methods: Proposed an Automatic Question Generation (AQG) system which automatically generates the assessment questions dynamically from the input file. Objective: The Proposed system is to generate the test questions that are mapped with blooms taxonomy to determine the learner’s cognitive level. The cloze type questions are generated using the tag part-of-speech and random function. Rule-based approaches and Natural Language Processing (NLP) techniques are implemented to generate the procedural question of the lowest blooms cognitive levels. Analysis: The outputs are dynamic in nature to create a different set of questions at each execution. Here, input paragraph is selected from computer science domain and their output efficiency are measured using the precision and recall.

Download Full-text

Pragmatics for Latin

10.1093/oso/9780190939472.001.0001 ◽

2019 ◽

Author(s):

A. M. Devine ◽

Laurence D. Stephens

Keyword(s):

Noun Phrase ◽

Word Order ◽

Information Structure ◽

Descriptive Analysis ◽

Informational Content ◽

Compositional Process ◽

Association With Focus ◽

Narrow Scope ◽

Order Language ◽

Free Word

Latin is often described as a free word order language, but in general each word order encodes a particular information structure: in that sense, each word order has a different meaning. This book provides a descriptive analysis of Latin information structure based on detailed philological evidence and elaborates a syntax-pragmatics interface that formalizes the informational content of the various different word orders. The book covers a wide ranges of issues including broad scope focus, narrow scope focus, double focus, topicalization, tails, focus alternates, association with focus, scrambling, informational structure inside the noun phrase and hyperbaton (discontinuous constituency). Using a slightly adjusted version of the structured meanings theory, the book shows how the pragmatic meanings matching the different word orders arise naturally and spontaneously out of the compositional process as an integral part of a single semantic derivation covering denotational and informational meaning at one and the same time.

Download Full-text

Processing subject focus across two Spanish varieties

Probus ◽

10.1515/probus-2019-0004 ◽

2020 ◽

Vol 32 (1) ◽

pp. 93-127

Author(s):

Bradley Hoot ◽

Tania Leal

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Word Order ◽

Judgment Task ◽

Spanish Speakers ◽

Experimental Results ◽

Final Position ◽

Reading Task ◽

New Information ◽

Basic Facts

AbstractLinguists have keenly studied the realization of focus – the part of the sentence introducing new information – because it involves the interaction of different linguistic modules. Syntacticians have argued that Spanish uses word order for information-structural purposes, marking focused constituents via rightmost movement. However, recent studies have challenged this claim. To contribute sentence-processing evidence, we conducted a self-paced reading task and a judgment task with Mexican and Catalonian Spanish speakers. We found that movement to final position can signal focus in Spanish, in contrast to the aforementioned work. We contextualize our results within the literature, identifying three basic facts that theories of Spanish focus and theories of language processing should explain, and advance a fourth: that mismatches in information-structural expectations can induce processing delays. Finally, we propose that some differences in the existing experimental results may stem from methodological differences.

Download Full-text