Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger forNlanguages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).

Download Full-text

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00288 ◽

2019 ◽

Vol 7 ◽

pp. 597-610 ◽

Cited By ~ 26

Author(s):

Mikel Artetxe ◽

Holger Schwenk

Keyword(s):

Natural Language ◽

Similarity Search ◽

Document Classification ◽

Data Set ◽

Test Set ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Cross Lingual

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER .

Download Full-text

Cross-Lingual Word Embeddings for Low-Resource Language Modeling

10.18653/v1/e17-1088 ◽

2017 ◽

Cited By ~ 7

Author(s):

Oliver Adams ◽

Adam Makarucha ◽

Graham Neubig ◽

Steven Bird ◽

Trevor Cohn

Keyword(s):

Language Modeling ◽

Word Embeddings ◽

Low Resource ◽

Cross Lingual

Download Full-text

A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/582 ◽

2017 ◽

Author(s):

Liangchen Wei ◽

Zhi-Hong Deng

Keyword(s):

Language Learning ◽

Latent Variable ◽

Training Data ◽

Variational Model ◽

Word Embeddings ◽

Parallel Corpora ◽

Word Level ◽

Sentence Level ◽

Cross Lingual ◽

Traditional Approaches

Cross-language learning allows one to use training data from one language to build models for another language. Many traditional approaches require word-level alignment sentences from parallel corpora, in this paper we define a general bilingual training objective function requiring sentence level parallel corpus only. We propose a variational autoencoding approach for training bilingual word embeddings. The variational model introduces a continuous latent variable to explicitly model the underlying semantics of the parallel sentence pairs and to guide the generation of the sentence pairs. Our model restricts the bilingual word embeddings to represent words in exactly the same continuous vector space. Empirical results on the task of cross lingual document classification has shown that our method is effective.

Download Full-text

Unsupervised Interlingual Semantic Representations from Sentence Embeddings for Zero-Shot Cross-Lingual Transfer

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6302 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7944-7951

Author(s):

Channy Hong ◽

Jaeyeon Lee ◽

Jungkwon Lee

Keyword(s):

Natural Language ◽

High Performance ◽

Data Sets ◽

Semantic Representations ◽

Parallel Corpora ◽

Low Resource ◽

Parallel Data ◽

Language Data ◽

Cross Lingual ◽

Inference Task

As numerous modern NLP models demonstrate high-performance in various tasks when trained with resource-rich language data sets such as those of English, there has been a shift in attention to the idea of applying such learning to low-resource languages via zero-shot or few-shot cross-lingual transfer. While the most prominent efforts made previously on achieving this feat entails the use of parallel corpora for sentence alignment training, we seek to generalize further by assuming plausible scenarios in which such parallel data sets are unavailable. In this work, we present a novel architecture for training interlingual semantic representations on top of sentence embeddings in a completely unsupervised manner, and demonstrate its effectiveness in zero-shot cross-lingual transfer in natural language inference task. Furthermore, we showcase a method of leveraging this framework in a few-shot scenario, and finally analyze the distributional and permutational alignment across languages of these interlingual semantic representations.

Download Full-text

Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review

Digital ◽

10.3390/digital1030011 ◽

2021 ◽

Vol 1 (3) ◽

pp. 145-161

Author(s):

Kowshik Bhowmik ◽

Anca Ralescu

Keyword(s):

Systematic Review ◽

Natural Language Processing ◽

Literature Review ◽

Natural Language ◽

Vector Space ◽

Language Processing ◽

Word Embedding ◽

Word Embeddings ◽

Low Resource ◽

Cross Lingual

This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.

Download Full-text