scholarly journals Developing a POS Tagged Corpus of Urdu Tweets

Computers ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 90
Author(s):  
Amber Baig ◽  
Mutee U Rahman ◽  
Hameedullah Kazi ◽  
Ahsanullah Baloch

Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.

2020 ◽  
Vol 49 (4) ◽  
pp. 482-494
Author(s):  
Jurgita Kapočiūtė-Dzikienė ◽  
Senait Gebremichael Tesfagergish

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.


Author(s):  
Sunita Warjri ◽  
Partha Pakray ◽  
Saralin A. Lyngdoh ◽  
Arnab Kumar Maji

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.


Author(s):  
Dim Lam Cing ◽  
Khin Mar Soe

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Imad Zeroual ◽  
Abdelhak Lakhouaja

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.


2020 ◽  
Vol 7 (6) ◽  
pp. 1121
Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

<p class="Abstrak">Bahasa Madura adalah bahasa daerah yang selain digunakan di Pulau Madura juga digunakan di daerah lainnya seperti di kota Jember, Pasuruan, dan Probolinggo. Sebagai bahasa daerah, Bahasa Madura mulai banyak ditinggalkan khususnya di kalangan anak muda. Beberapa penyebabnya adalah adanya rasa gengsi dan tingkat kesulitan untuk mempelajari Bahasa Madura yang memiliki ragam dialek dan tingkat bahasa. Berkurangnya penggunaan Bahasa Madura dapat mengakibatkan punahnya Bahasa Madura sebagai salah satu bahasa daerah yang ada di Indonesia. Oleh karena itu, perlu adanya usaha untuk mempertahankan dan memelihara Bahasa Madura. Salah satunya adalah dengan melakukan penelitian tentang Bahasa Madura dalam bidang <em>Natural Language Processing</em> sehingga kedepannya pembelajaran tentang Bahasa Madura dapat dilakukan melalui media digital. <em>Part Of Speech</em> (POS) <em>Tagging</em> adalah dasar penelitian <em>text processing</em>, sehingga perlu untuk dibuat aplikasi POS Tagging Bahasa Madura untuk digunakan pada penelitian <em>Natural Languange Processing </em>lainnya. Dalam penelitian ini, POS Tagging dibuat dengan menggunakan Algoritma Brill Tagger dengan menggunakan <em>corpus</em> yang berisi 10.535 kata Bahasa Madura. POS Tagging dengan Brill Tagger dapat memberikan kelas kata yang sesuai pada kata dengan menggunakan aturan leksikal dan kontekstual.  Brill Tagger merupakan algoritma dengan tingkat akurasi yang paling baik saat diterapkan dalam Bahasa Inggris, Bahasa Indonesia dan beberapa bahasa lainnya. Dari serangkaian percobaan dengan beberapa perubahan nilai <em>threshold</em> tanpa memperhatikan OOV <em>(Out Of Vocabulary)</em><em>,</em> menunjukkan rata-rata akurasi mencapai lebih dari 80% dengan akurasi tertinggi mencapai 86.67% dan untuk pengujian dengan memperhatikan OOV mencapai rata-rata akurasi 67.74%. Jadi dapat disimpulkan bahwa Brill Tagger dapat digunakan untuk Bahasa Madura dengan tingkat akurasi yang baik.</p><p class="Abstrak"> </p><p class="Judul2"><strong><em>Abstract</em> </strong></p><p class="Judul2"><em>Bahasa Madura</em><em> is regional language which is not only used on Madura Island but is also used in other areas such as in several regions in Jember, Pasuruan,</em><em> and</em><em> Probolinggo. </em><em>Today</em><em>, </em><em>Bahasa Madura</em><em> began to be abandoned, especially among young people. One reason is </em><em>sense of pride and</em><em> also quite difficult to learn </em><em>Bahasa Madura </em><em>because it has a variety of dialects and language levels. The reduced use of </em><em>Bahasa Madura</em><em> can lead to the extinction of </em><em>Bahasa Madura</em><em> as one of the regional languages in Indonesia</em><em>. </em><em>Therefore, there needs to be an effort to maintain Madurese Language. One of them is by conducting research on Madurese Language in the field of Natural Language Processing so that in the future learning about Madurese can be done through digital media.</em><em> </em><em>Part of Speech (POS) Tagging is the basis of text processing research, so the Madura Language POS Tagging application needs to be made for use in other Natural Language Processing research.</em><em> </em><em>This study uses Brill Tagger </em><em>by using a corpus containing 10,535 words</em><em>. POS </em><em>Tagging with Brill Tagger Algorithm </em><em>can provide</em><em> the </em><em>appropriate word class to word</em><em> using lexical and contextual rule</em><em>.</em><em> </em><em>The reason for using Brill Tagger is because it is the algorithm that has the best accuracy when implemented in English, Indonesian and several other languages.</em><em> </em><em>The experimental results with Brill Tagger show that the average accuracy</em><em> without OOV</em><em> (Out Of Vocabulary) obtained is 86.6% with the highest accuracy of 86.94% and the average accuracy for OOV words reached 67.22%. So it can be concluded that the Brill Tagger Algorithm can also be used for Bahasa Madura with a good degree of accuracy.</em></p>


Author(s):  
Kesavan Vadakalur Elumalai ◽  
Niladri Sekhar Das ◽  
Mufleh Salem M. Alqahtani ◽  
Anas Maktabi

Part-of-speech (POS) tagging is an indispensable method of text processing. The main aim is to assign part-of-speech to words after considering their actual contextual syntactic-cum-semantic roles in a piece of text where they occur (Siemund & Claridge 1997). This is a useful strategy in language processing, language technology, machine learning, machine translation, and computational linguistics as it generates a kind of output that enables a system to work with natural language texts with greater accuracy and success. Part-of-speech tagging is also known as ‘grammatical annotation’ and ‘word category disambiguation’ in some area of linguistics where analysis of form and function of words are important avenues for better comprehension and application of texts. Since the primary task of POS tagging involves a process of assigning a tag to each word, manually or automatically, in a piece of natural language text, it has to pay adequate attention to the contexts where words are used. This is a tough challenge for a system as it normally fails to know how word carries specific linguistic information in a text and what kind of larger syntactic frames it requires for its operation. The present paper takes up this issue into consideration and tries to critically explore how some of the well-known POS tagging systems are capable of handling this kind of challenge and if these POS tagging systems are at all successful in assigning appropriate POS tags to words without accessing information from extratextual domains. The novelty of the paper lies in its attempt for looking into some of the POS tagging schemes proposed so far to see if the systems are actually successful in dealing with the complexities involved in tagging words in texts. It also checks if the performance of these systems is better than manual POS tagging and verifies if information and insights gathered from such enterprises are at all useful for enhancing our understanding about identity and function of words used in texts. All these are addressed in this paper with reference to some of the POS taggers available to us. Moreover, the paper tries to see how a POS tagged text is useful in various applications thereby creating a sense of awareness about multifunctionality of tagged texts among language users.


Author(s):  
Anto Arockia Rosaline R. ◽  
Parvathi R.

Text analytics is the process of extracting high quality information from the text. A set of statistical, linguistic, and machine learning techniques are used to represent the information content from various textual sources such as data analysis, research, or investigation. Text is the common way of communication in social media. The understanding of text includes a variety of tasks including text classification, slang, and other languages. Traditional Natural Language Processing (NLP) techniques require extensive pre-processing techniques to handle the text. When a word “Amazon” occurs in the social media text, there should be a meaningful approach to find out whether it is referring to forest or Kindle. Most of the time, the NLP techniques fail in handling the slang and spellings correctly. Messages in Twitter are so short such that it is difficult to build semantic connections between them. Some messages such as “Gud nite” actually do not contain any real words but are still used for communication.


2021 ◽  
pp. 587-595
Author(s):  
Alebachew Chiche ◽  
Hiwot Kadi ◽  
Tibebu Bekele

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki’noono language.  POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki’noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins.  The collected sentences are labeled by language experts with their appropriate parts of speech for each word.  With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki’noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.


2016 ◽  
Vol 26 (04) ◽  
pp. 1750060 ◽  
Author(s):  
Chengyao Lv ◽  
Huihua Liu ◽  
Yuanxing Dong ◽  
Fangyuan Li ◽  
Yuan Liang

In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with POS labels corresponding to categories such as noun, verb or adjective. This paper proposes a model of uniform-design genetic expression programming (UGEP) for POS tagging. UGEP is used to search for appropriate structures in function space of POS tagging problems. After the evolution of sequence of tags, GEP can find the best individual as solution. Experiments on Brown Corpus show that (1) in closed lexicon tests, UGEP model can get higher accuracy rate of 98.8% which is much better than genetic algorithm model, neural networks and hidden Markov model (HMM) model.; (2) in open lexicon tests, the proposed model can also achieve higher accuracy rate of 97.4% and a high accuracy rate on unknown words of 88.6%.


Sign in / Sign up

Export Citation Format

Share Document