scholarly journals Morphological Tagging and Lemmatization in the Albanian Language

SEEU Review ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. 3-16
Author(s):  
Diellza Nagavci Mati ◽  
Mentor Hamiti ◽  
Elissa Mollakuqe

Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.

Author(s):  
Rahul Sharan Renu ◽  
Gregory Mocko

The objective of this research is to investigate the requirements and performance of parts-of-speech tagging of assembly work instructions. Natural Language Processing of assembly work instructions is required to perform data mining with the objective of knowledge reuse. Assembly work instructions are key process engineering elements that allow for predictable assembly quality of products and predictable assembly lead times. Authoring of assembly work instructions is a subjective process. It has been observed that most assembly work instructions are not grammatically complete sentences. It is hypothesized that this can lead to false parts-of-speech tagging (by Natural Language Processing tools). To test this hypothesis, two parts-of-speech taggers are used to tag 500 assembly work instructions (obtained from the automotive industry). The first parts-of-speech tagger is obtained from Natural Language Processing Toolkit (nltk.org) and the second parts-of-speech tagger is obtained from Stanford Natural Language Processing Group (nlp.stanford.edu). For each of these taggers, two experiments are conducted. In the first experiment, the assembly work instructions are input to the each tagger in raw form. In the second experiment, the assembly work instructions are preprocessed to make them grammatically complete, and then input to the tagger. It is found that the Stanford Natural Language Processing tagger with the preprocessed assembly work instructions produced the least number of false parts-of-speech tags.


2020 ◽  
Vol 8 (5) ◽  
pp. 1061-1068

Now-a-days people interest to spend their time in social sites especially twitters to post lot of tweets in every day. The posted tweets are used by many users to get the knowledge about the particular applications, products and other search engine queries. With the help of the posted tweets, their emotions and sentiments are derived which are used to get opinion about particular event. Lot of traditional sentiment detection system that has been developed but they failed to analyze huge volume of tweets and online contents with temporal patterns were also difficult to analyze. To overcome the above issues, the co-ranking multi-modal natural language processing based sentiment analysis system was developed to detect the emotions from the posted tweets. Initially, tweets of different events are collected from social sites which are processed by natural language procedures such as Stemming, Lemmatization, Part-of-speech tagging, word segmentation and parsing are applied to get the words related to posted tweets for deriving the sentiments. From the extracted emotions, co-ranking process is applied to get the opinion effectively related to particular event. Then the efficiency of the system is examined using experimental results and discussions. The introduced system recognize the sentiments from tweets with 98.80% of accuracy.


2015 ◽  
Author(s):  
Abraham G Ayana

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.


Author(s):  
Kiran Raj R

Today, everyone has a personal device to access the web. Every user tries to access the knowledge that they require through internet. Most of the knowledge is within the sort of a database. A user with limited knowledge of database will have difficulty in accessing the data in the database. Hence, there’s a requirement for a system that permits the users to access the knowledge within the database. The proposed method is to develop a system where the input be a natural language and receive an SQL query which is used to access the database and retrieve the information with ease. Tokenization, parts-of-speech tagging, lemmatization, parsing and mapping are the steps involved in the process. The project proposed would give a view of using of Natural Language Processing (NLP) and mapping the query in accordance with regular expression in English language to SQL.


2020 ◽  
Vol 26 (6) ◽  
pp. 595-612
Author(s):  
Marcos Zampieri ◽  
Preslav Nakov ◽  
Yves Scherrer

AbstractThere has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.


2015 ◽  
Author(s):  
Vijaykumar Yogesh Muley ◽  
Anne Hahn ◽  
Pravin Paikrao

Natural language processing continues to gain importance in a thriving scientific community that communicates its latest results in such a frequency that following up on the most recent developments even in a specific field cannot be managed by human readers alone. Here we summarize and compare the publishing activity of the previous years on a distinct topic across several countries, addressing not only publishing frequency and history, but also stylistic characteristics that are accessible by means of natural language processing. Though there are no profound differences in the sentence lengths or lexical diversity among different countries, writing styles approached by Part-Of-Speech tagging are similar among countries that share history or official language or those are spatially close.


Sign in / Sign up

Export Citation Format

Share Document