scholarly journals Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

2020 ◽  
Vol 34 (05) ◽  
pp. 8066-8073
Author(s):  
Katharina Kann ◽  
Ophélie Lacroix ◽  
Anders Søgaard

Part-of-speech (POS) taggers for low-resource languages which are exclusively based on various forms of weak supervision – e.g., cross-lingual transfer, type-level supervision, or a combination thereof – have been reported to perform almost as well as supervised ones. However, weakly supervised POS taggers are commonly only evaluated on languages that are very different from truly low-resource languages, and the taggers use sources of information, like high-coverage and almost error-free dictionaries, which are likely not available for resource-poor languages. We train and evaluate state-of-the-art weakly supervised POS taggers for a typologically diverse set of 15 truly low-resource languages. On these languages, given a realistic amount of resources, even our best model gets only less than half of the words right. Our results highlight the need for new and different approaches to POS tagging for truly low-resource languages.

Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1372
Author(s):  
Sanjanasri JP ◽  
Vijay Krishna Menon ◽  
Soman KP ◽  
Rajendran S ◽  
Agnieszka Wolk

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.


Author(s):  
Željko Agić ◽  
Anders Johannsen ◽  
Barbara Plank ◽  
Héctor Martínez Alonso ◽  
Natalie Schluter ◽  
...  

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.


2019 ◽  
Author(s):  
Lingjun Zhao ◽  
Rabih Zbib ◽  
Zhuolin Jiang ◽  
Damianos Karakos ◽  
Zhongqiang Huang

2021 ◽  
Author(s):  
David Sabiiti Bamutura

Current research in computational linguistics and NLP requires the existence of language resources. Whereas these resources are available for only a few well-resourced languages, there are many languages that have been neglected. Among the neglected and / or under-resourced languages are Runyankore and Rukiga (henceforth referred to as Ry/Rk). In this paper, we report on Ry/Rk-Lex, a moderately large computational lexicon for Ry/Rk that we constructed from various existing data sources. Ry/Rk are two under-resourced Bantu languages with virtually no computational resources. About 9,400 lemmata have been entered so far. Ry/Rk-Lex has been enriched with syntactic and lexical semantic features, with the intent of providing a reference computational lexicon for Ry/Rk in other NLP (1) tasks such as: morphological analysis and generation; part of speech (POS) tagging; named entity recognition (NER); and (2) applications such as: spell and grammar checking; and cross-lingual information retrieval (CLIR). We have used Ry/Rk-Lex to dramatically increase the lexical coverage of previously developed computational resource grammars for Ry/Rk.


2018 ◽  
Vol 25 (1) ◽  
pp. 43-67
Author(s):  
O. ZENNAKI ◽  
N. SEMMAR ◽  
L. BESACIER

AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger forNlanguages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).


Author(s):  
Xilun Chen ◽  
Yu Sun ◽  
Ben Athiwaratkun ◽  
Claire Cardie ◽  
Kilian Weinberger

In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN 1 ) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.


2020 ◽  
Vol 8 ◽  
pp. 109-124
Author(s):  
Shuyan Zhou ◽  
Shruti Rijhwani ◽  
John Wieting ◽  
Jaime Carbonell ◽  
Graham Neubig

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1


2021 ◽  
Vol 13 (24) ◽  
pp. 5009
Author(s):  
Lingbo Huang ◽  
Yushi Chen ◽  
Xin He

In recent years, supervised learning-based methods have achieved excellent performance for hyperspectral image (HSI) classification. However, the collection of training samples with labels is not only costly but also time-consuming. This fact usually causes the existence of weak supervision, including incorrect supervision where mislabeled samples exist and incomplete supervision where unlabeled samples exist. Focusing on the inaccurate supervision and incomplete supervision, the weakly supervised classification of HSI is investigated in this paper. For inaccurate supervision, complementary learning (CL) is firstly introduced for HSI classification. Then, a new method, which is based on selective CL and convolutional neural network (SeCL-CNN), is proposed for classification with noisy labels. For incomplete supervision, a data augmentation-based method, which combines mixup and Pseudo-Label (Mix-PL) is proposed. And then, a classification method, which combines Mix-PL and CL (Mix-PL-CL), is designed aiming at better semi-supervised classification capacity of HSI. The proposed weakly supervised methods are evaluated on three widely-used hyperspectral datasets (i.e., Indian Pines, Houston, and Salinas datasets). The obtained results reveal that the proposed methods provide competitive results compared to the state-of-the-art methods. For inaccurate supervision, the proposed SeCL-CNN has outperformed the state-of-the-art method (i.e., SSDP-CNN) by 0.92%, 1.84%, and 1.75% in terms of OA on the three datasets, when the noise ratio is 30%. And for incomplete supervision, the proposed Mix-PL-CL has outperformed the state-of-the-art method (i.e., AROC-DP) by 1.03%, 0.70%, and 0.82% in terms of OA on the three datasets, with 25 training samples per class.


Author(s):  
M. Bevza

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a number of new architectures for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few architectures which we consider to be most likely to improve the existing state of the art solutions for named entity recognition task and part of speech tasks. The architectures are inspired by recent developments in multi-task learning. This work tests the hypothesis that NER and POS are related tasks and adding information about POS tags as input to the network can help achieve better NER results. And vice versa, information about NER tags can help solve the task of POS tagging. This work also contains the implementation of the network and results of the experiments together with the conclusions and future work.


2021 ◽  
Vol 9 ◽  
pp. 1389-1406
Author(s):  
Shayne Longpre ◽  
Yi Lu ◽  
Joachim Daiber

Abstract Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open- domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on heavily curated, language- independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state- of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages.1


Sign in / Sign up

Export Citation Format

Share Document