part of speech tagging Latest Research Papers

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.

Download Full-text

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Electronics ◽

10.3390/electronics11010056 ◽

2021 ◽

Vol 11 (1) ◽

pp. 56

Author(s):

Hongwei Li ◽

Hongyan Mao ◽

Jingzi Wang

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Contextual Information ◽

Data Preprocessing ◽

Experimental Result ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Novel Approach

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

Download Full-text

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Computational Linguistics ◽

10.1162/coli_a_00425 ◽

2021 ◽

pp. 1-38

Author(s):

Gözde Gül Şahin

Keyword(s):

Language Models ◽

Semantic Role ◽

Semantic Role Labeling ◽

Dependency Parsing ◽

Low Resource ◽

Part Of Speech Tagging ◽

High Resource ◽

Part Of Speech ◽

Augmentation Techniques ◽

Speech Tagging

Abstract Data-hungry deep neural networks have established themselves as the defacto standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies which perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels.We systematically compare the methods on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families using various models including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and the model type (e.g., token-level augmentation provide significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).

Download Full-text

A Rule-Based Approach for Marathi Part-of-Speech Tagging

10.1007/978-981-16-4177-0_76 ◽

2021 ◽

pp. 773-785

Author(s):

P. Kadam Vaishali ◽

Khandale Kalpana ◽

C. Namrata Mahender

Keyword(s):

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Rule Based Approach ◽

Speech Tagging

Download Full-text

A Template-Based Approach for Tagging Non-Vocalized Arabic Nouns

Academic Journal of Research and Scientific Publishing ◽

10.52132/ajrsp.e.2021.32.1 ◽

2021 ◽

Vol 3 (32) ◽

pp. 05-35

Author(s):

Hashem Alsharif ◽

Keyword(s):

Linear Part ◽

Arabic Language ◽

Arabic Text ◽

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Pos Tagger ◽

Log Linear ◽

Speech Tagging

There exist no corpora of Arabic nouns. Furthermore, in any Arabic text, nouns can be found in different forms. In fact, by tagging nouns in an Arabic text, the beginning of each sentence can determine whether it starts with a noun or a verb. Part of Speech Tagging (POS) is the task of labeling each word in a sentence with its appropriate category, which is called a Tag (Noun, Verb and Article). In this thesis, we attempt to tag non-vocalized Arabic text. The proposed POS Tagger for Arabic Text is based on searching for each word of the text in our lists of Verbs and Articles. Nouns are found by eliminating Verbs and Articles. Our hypothesis states that, if the word in the text is not found in our lists, then it is a Noun. These comparisons will be made for each of the words in the text until all of them have been tagged. To apply our method, we have prepared a list of articles and verbs in the Arabic language with a total of 112 million verbs and articles combined, which are used in our comparisons to prove our hypothesis. To evaluate our proposed method, we used pre-tagged words from "The Quranic Arabic Corpus", making a total of 78,245 words, with our method, the Template-based tagging approach compared with (AraMorph) a rule-based tagging approach and the Stanford Log-linear Part-Of-Speech Tagger. Finally, AraMorph produced 40% correctly-tagged words and Stanford Log-linear Part-Of-Speech Tagger produced 68% correctly-tagged words, while our method produced 68,501 correctly-tagged words (88%).

Download Full-text

Part-Of-Speech Tagging for Mizo Language Using Conditional Random Field

Computación y Sistemas ◽

10.13053/cys-25-4-4044 ◽

2021 ◽

Vol 25 (4) ◽

Author(s):

Morrel VL Nunsanga ◽

Partha Pakray ◽

C. Lallawmsanga ◽

L. Lolit Kumar Singh

Keyword(s):

Random Field ◽

Conditional Random Field ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Amazigh part-of-speech tagging with machine learning and deep learning

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i3.pp1814-1822 ◽

2021 ◽

Vol 24 (3) ◽

pp. 1814

Author(s):

Otman Maarouf ◽

Rachid El Ayachi ◽

Mohamed Biniz

Keyword(s):

Decision Tree ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Long Distance ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

French And English ◽

Speech Tagging

Natural language processing (NLP) is a part of artificial intelligence that dissects, comprehends, and changes common dialects with computers in composed and spoken settings. At that point in scripts. Grammatical features part-of-speech (POS) allow marking the word as per its statement. We find in the literature that POS is used in a few dialects, in particular: French and English. This paper investigates the attention-based long short-term memory (LSTM) networks and simple recurrent neural network (RNN) in Tifinagh POS tagging when it is compared to conditional random fields (CRF) and decision tree. The attractiveness of LSTM networks is their strength in modeling long-distance dependencies. The experiment results show that LSTM networks perform better than RNN, CRF and decision tree that has a near performance.

Download Full-text

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464378 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-16

Author(s):

Hour Kaing ◽

Chenchen Ding ◽

Masao Utiyama ◽

Eiichiro Sumita ◽

Sethserey Sam ◽

...

Keyword(s):

Short Term Memory ◽

Conditional Random Field ◽

Support Vector ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Long Short Term Memory ◽

Syntactic Annotation ◽

Near Future ◽

Speech Tagging

As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF) , a long short-term memory (LSTM) -based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.

Download Full-text

Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging

10.5753/eniac.2021.18273 ◽

2021 ◽

Author(s):

Emanuel Huber da Silva ◽

Thiago Alexandre Salgueiro Pardo ◽

Norton Trevisan Roman ◽

Ariani Di Fellipo

Keyword(s):

State Of The Art ◽

Preliminary Evidence ◽

Brazilian Portuguese ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Current State ◽

Language User ◽

Pos Tagger ◽

Speech Tagging

Automatically dealing with Natural Language User-Generated Content (UGC) is a challenging task of utmost importance, given the amount of information available over the web. We present in this paper an effort on building tokenization and Part of Speech (PoS) tagging systems for tweets in Brazilian Portuguese, following the guidelines of the Universal Dependencies (UD) project. We propose a rule-based tokenizer and the customization of current state-of-the-art UD-based tagging strategies for Portuguese, achieving a 98% f-score for tokenization, and a 95% f-score for PoS tagging. We also introduce DANTEStocks, the corpus of stock market tweets on which we base our work, presenting preliminary evidence of the multi-genre capacity of our PoS tagger.

Download Full-text

part of speech tagging
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Transformer-based Neural Model for Chinese Word Segmentation and Part-of-Speech Tagging

Learning from Disagreement: A Survey

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

A Rule-Based Approach for Marathi Part-of-Speech Tagging

A Template-Based Approach for Tagging Non-Vocalized Arabic Nouns

Part-Of-Speech Tagging for Mizo Language Using Conditional Random Field

Amazigh part-of-speech tagging with machine learning and deep learning

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging

Export Citation Format

part of speech taggingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Transformer-based Neural Model for Chinese Word Segmentation and Part-of-Speech Tagging

Learning from Disagreement: A Survey

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

A Rule-Based Approach for Marathi Part-of-Speech Tagging

A Template-Based Approach for Tagging Non-Vocalized Arabic Nouns

Part-Of-Speech Tagging for Mizo Language Using Conditional Random Field

Amazigh part-of-speech tagging with machine learning and deep learning

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging

part of speech tagging
Recently Published Documents