Development of Part of Speech Tagger for Assamese Using HMM

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.

Download Full-text

Development of Part of Speech Tagger for Assamese Using HMM

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch054 ◽

2020 ◽

pp. 1139-1148

Author(s):

Surjya Kanta Daimary ◽

Vishal Goyal ◽

Madhumita Barbora ◽

Umrinderpal Singh

Keyword(s):

Language Processing ◽

Stochastic Models ◽

Hidden Markov ◽

Stochastic Approach ◽

Point Of View ◽

Part Of Speech ◽

Pos Tagger ◽

Asian Languages ◽

Free Word ◽

Assamese Language

Download Full-text

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Information Technology And Control ◽

10.5755/j01.itc.49.4.26808 ◽

2020 ◽

Vol 49 (4) ◽

pp. 482-494

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Senait Gebremichael Tesfagergish

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Deep Neural Networks ◽

Short Term Memory ◽

Parameter Tuning ◽

Feed Forward Neural Network ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.

Download Full-text

Aspect-Based Sentiment Analysis of Online Product Reviews

Advances in Business Information Systems and Analytics - Handbook of Research on Advanced Data Mining Techniques and Applications for Business Intelligence ◽

10.4018/978-1-5225-2031-3.ch010 ◽

2017 ◽

pp. 175-191 ◽

Cited By ~ 1

Author(s):

Vinod Kumar Mishra ◽

Himanshu Tiruwa

Keyword(s):

Sentiment Analysis ◽

Computational Linguistics ◽

Language Processing ◽

Future Trend ◽

Product Reviews ◽

Customer Reviews ◽

Part Of Speech ◽

Lexical Approach ◽

Sentiment Score ◽

Pos Tagger

Sentiment analysis is a part of computational linguistics concerned with extracting sentiment and emotion from text. It is also considered as a task of natural language processing and data mining. Sentiment analysis mainly concentrate on identifying whether a given text is subjective or objective and if it is subjective, then whether it is negative, positive or neutral. This chapter provide an overview of aspect based sentiment analysis with current and future trend of research on aspect based sentiment analysis. This chapter also provide a aspect based sentiment analysis of online customer reviews of Nokia 6600. To perform aspect based classification we are using lexical approach on eclipse platform which classify the review as a positive, negative or neutral on the basis of features of product. The Sentiwordnet is used as a lexical resource to calculate the overall sentiment score of each sentence, pos tagger is used for part of speech tagging, frequency based method is used for extraction of the aspects/features and used negation handling for improving the accuracy of the system.

Download Full-text

A NOVEL AND EFFICIENT METHOD FOR PARSING UNRESTRICTED TEXTS OF QUASI FREE WORD ORDER LANGUAGES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213095000152 ◽

1995 ◽

Vol 04 (03) ◽

pp. 301-321 ◽

Cited By ~ 3

Author(s):

S.E. MICHOS ◽

N. FAKOTAKIS ◽

G. KOKKINAKIS

Keyword(s):

Pattern Matching ◽

Language Processing ◽

Word Order ◽

Theoretical Background ◽

Greek Language ◽

Rule Based ◽

Early Processing ◽

Matching Techniques ◽

Free Word ◽

Large Category

This paper deals with the problems stemming from the parsing of long sentences in quasi free word order languages. Due to the word order freedom of a large category of languages including Greek and the limitations of rule-based grammar parsers in parsing unrestricted texts of such languages, we propose a flexible and effective method for parsing long sentences of such languages that combines heuristic information and pattern-matching techniques in early processing levels. This method is deeply characterized by its simplicity and robustness. Although it has been developed and tested for the Greek language, its theoretical background, implementation algorithm and results are language independent and can be of considerable value for many practical natural language processing (NLP) applications involving parsing of unrestricted texts.

Download Full-text

Development of Part of Speech Tagger using Deep Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1531.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 3384-3391

Keyword(s):

Language Processing ◽

Initial Step ◽

Processing Application ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Order Language ◽

Popular Language ◽

Speech Tagging ◽

Free Word

Part of speech tagging is the initial step in development of NLP (natural language processing) application. POS Tagging is sequence labelling task in which we assign Part-of-speech to every word (Wi) which is sequence in sentence and tag (Ti) to corresponding word as label such as (Wi/Ti…. Wn/Tn). In this research project part of speech tagging is perform on Hindi. Hindi is the fourth most popular language and spoken by approximately 4billion people across the globe. Hindi is free word-order language and morphologically rich language due to this applying Part of Speech tagging is very challenging task. In this paper we have shown the development of POS tagging using neural approach.

Download Full-text

Löytyä-verbin konstruktioiden yhteydessä esiintyvä subjektin sijanvaihtelu

Eesti ja soome-ugri keeleteaduse ajakiri Journal of Estonian and Finno-Ugric Linguistics ◽

10.12697/jeful.2021.12.1.01 ◽

2021 ◽

Vol 12 (1) ◽

pp. 11-39

Author(s):

Rodolfo Basile ◽

Ilmari Ivaska

Keyword(s):

Regression Analysis ◽

Word Order ◽

Random Variable ◽

Point Of View ◽

Case Alternation ◽

Part Of Speech ◽

Subject Number ◽

The Subject ◽

Semantic Variables ◽

The Relationship

Abstrakti. Artikkeli tarkastelee löytyä-verbin konstruktioiden nominatiivi- ja partitiivisubjektin vaihtelua. Aineistona on korpuksista poimittu 779 havainnon satunnaisotos, jota tarkastellaan sekä kvantitatiivisesti tilastollisin menetelmin että kvalitatiivisesta näkökulmasta. Tutkimus pyrkii selvittämään, mitkä muuttujat vaikuttavat löytyä-verbin sisältävien lauseiden subjektien sijanvalintaan. Valikoidut muuttujat ovat subjektin luku, subjektin jaollisuus, subjektin sanaluokka, sanajärjestys, aikamuoto, subjektin ja verbin välinen kongruenssi sekä subjektin lemma, joka toimii satunnaismuuttujana. Regressioanalyysin keinoin subjektin sijanvalintaa ennustetaan mainittujen muuttujien ja niiden välisten vuorovaikutussuhteiden avulla. Laadullisessa analyysissa käsitellään myös näiden morfosyntaktisten ja semanttisten seikkojen vaikutusta lauseen eksistentiaalisuuden sekä subjektin kvantiteetin ja definiittisyyden tulkintaan. Abstract. Rodolfo Basile, Ilmari Ivaska: Subject case alternation in constructions containing the Finnish verb löytyä. This article examines the nominative-partitive subject alternation occurring with constructions containing the Finnish verb löytyä. The material used is taken from corpora and consists of a random sample of 779 observations, analyzed both quantitatively by means of statistical methods, and from a qualitative point of view. The research aims at investigating which variables influence the case alternation of subjects of constructions containing the verb löytyä. The chosen variables are subject number, subject divisibility, subject part of speech, word order, tense, agreement and subject lemma, the only random variable. With the help of regression analysis, the subject case is predicted on the basis of said variables and of interactions between them. The qualitative analysis will also discuss the relationship these morphosyntactic and semantic variables have with the existential interpretations of the clause as well as with the subject quantity and definiteness. Kokkuvõte. Rodolfo Basile, Ilmari Ivaska: Subjekti käändevaheldus löytyä-verbiga konstruktsioonides. Artiklis uuritakse nominatiivi- ja partitiivikujulise subjekti vaheldumist soome keele löytyä-verbi sisaldavates konstruktsioonides. 779 vaatlust sisaldavat korpustest pärinevat juhuvalimit analüüsitakse nii kvantitatiivsete kui ka kvalitatiivsete meetoditega. Uurimuse eesmärk on välja selgitada, millised tegurid löytyä-verbi sisaldavate konstruktsioonide subjekti käändevalikut mõjutavad. Käsitletavad tegurid on subjekti arv, loendatavus, sõnaliik, konstruktsiooni sõnajärg, ajavorm ning subjekti ja verbi ühildumine. Juhusliku muutujana kaasatakse ka subjekti lemma. Regressioonanalüüsi abil ennustatakse subjekti käändevalikut mainitud tegurite ja nendevaheliste koosmõjude kaudu. Kvalitatiivse analüüsi käigus arutletakse ka selle üle, milline on nimetatud morfosüntaktiliste ja semantiliste tegurite mõju lause eksistentsiaalsele tõlgendusele ning subjekti kvantiteedile ja definiitsusele.

Download Full-text

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488381 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-24

Author(s):

Sunita Warjri ◽

Partha Pakray ◽

Saralin A. Lyngdoh ◽

Arnab Kumar Maji

Keyword(s):

Deep Learning ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Research Work ◽

Pos Tagging ◽

Part Of Speech ◽

Corpus Size ◽

Increase In Accuracy ◽

Pos Tagger

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Download Full-text

Improving accuracy of Part-of-Speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp2023-2030 ◽

2020 ◽

Vol 10 (2) ◽

pp. 2023

Author(s):

Dim Lam Cing ◽

Khin Mar Soe

Keyword(s):

Language Processing ◽

High Performance ◽

Markov Models ◽

Hidden Markov ◽

Morphological Structure ◽

Word Segmentation ◽

Pos Tagging ◽

Part Of Speech ◽

Improving Accuracy ◽

Pos Tagger

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.

Download Full-text

Punjabi Pos Tagger: Rule Based and HMM

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0106 ◽

2017 ◽

Vol 7 (7) ◽

pp. 193

Author(s):

Umrinderpal Singh ◽

Vishal Goyal

Keyword(s):

Information Retrieval ◽

Language Processing ◽

State Of The Art ◽

Input Word ◽

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Unseen Data ◽

Pos Tagger ◽

Speech Tagging

The Part of Speech tagger system is used to assign a tag to every input word in a given sentence. The tags may include different part of speech tag for a particular language like noun, pronoun, verb, adjective, conjunction etc. and may have subcategories of all these tags. Part of Speech tagging is a basic and a preprocessing task of most of the Natural Language Processing (NLP) applications such as Information Retrieval, Machine Translation, and Grammar Checking etc. The task belongs to a larger set of problems, namely, sequence labeling problems. Part of Speech tagging for Punjabi is not widely explored territory. We have discussed Rule Based and HMM based Part of Speech tagger for Punjabi along with the comparison of their accuracies of both approaches. The System is developed using 35 different standard part of speech tag. We evaluate our system on unseen data with state-of-the-art accuracy 93.3%.

Download Full-text

Semantics and Pragmatics of Non-Canonical Word Order in South Asian Languages: of lag- ‘Begin’ as an Attitude-Marker in Hindi-Urdu

Lingua Posnaniensis ◽

10.2478/v10122-011-0010-9 ◽

2011 ◽

Vol 53 (2) ◽

pp. 25-34

Author(s):

Peter Hook

Keyword(s):

South Asian ◽

Word Order ◽

Final Word ◽

Semantics And Pragmatics ◽

Prime Factors ◽

Asian Languages ◽

Finite Verb

Semantics and Pragmatics of Non-Canonical Word Order in South Asian Languages: <Verb-Left> of lag- ‘Begin’ as an Attitude-Marker in Hindi-Urdu This paper examines possible motivations for departures from canonical clause-final word order observed for the finite verb in Hindi-Urdu and other modern Indo-Aryan languages. Depiction of speaker attitude in Premchand's novel godān and the imperatives of journalistic style in TV newscasts are shown to be prime factors. The emergence of V-2 word-order in Kashmiri and other Himalayan languages may have had a parallel history.

Download Full-text