Zero-Shot Neural Transfer for Cross-Lingual Entity Linking

Cross-lingual entity linking maps an entity mention in a source language to its corresponding entry in a structured knowledge base that is in a different (target) language. While previous work relies heavily on bilingual lexical resources to bridge the gap between the source and the target languages, these resources are scarce or unavailable for many low-resource languages. To address this problem, we investigate zero-shot cross-lingual entity linking, in which we assume no bilingual lexical resources are available in the source low-resource language. Specifically, we propose pivot-basedentity linking, which leverages information from a highresource “pivot” language to train character-level neural entity linking models that are transferred to the source lowresource language in a zero-shot manner. With experiments on 9 low-resource languages and transfer through a total of54 languages, we show that our proposed pivot-based framework improves entity linking accuracy 17% (absolute) on average over the baseline systems, for the zero-shot scenario.1 Further, we also investigate the use of language-universal phonological representations which improves average accuracy (absolute) by 36% when transferring between languages that use different scripts.

Download Full-text

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00303 ◽

2020 ◽

Vol 8 ◽

pp. 109-124

Author(s):

Shuyan Zhou ◽

Shruti Rijhwani ◽

John Wieting ◽

Jaime Carbonell ◽

Graham Neubig

Keyword(s):

State Of The Art ◽

Target Language ◽

Entity Linking ◽

Average Gain ◽

Source Language ◽

Low Resource ◽

High Resource ◽

Language Knowledge ◽

Cross Lingual ◽

Improved Model

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1

Download Full-text

How to Parse Low-Resource Languages: Cross-Lingual Parsing, Target Language Annotation, or Both?

10.18653/v1/w19-7713 ◽

2019 ◽

Author(s):

Ailsa Meechan-Maddon ◽

Joakim Nivre

Keyword(s):

Target Language ◽

Low Resource ◽

Cross Lingual

Download Full-text

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

10.18653/v1/w18-6324 ◽

2018 ◽

Cited By ~ 1

Author(s):

Zhong Zhou ◽

Matthias Sperber ◽

Alexander Waibel

Keyword(s):

Language Translation ◽

Target Language ◽

Massively Parallel ◽

Low Resource ◽

Cross Lingual

Download Full-text

Enhanced Meta-Learning for Cross-Lingual Named Entity Recognition with Minimal Resources

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6466 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9274-9281

Author(s):

Qianhui Wu ◽

Zijia Lin ◽

Guoxin Wang ◽

Hui Chen ◽

Börje F. Karlsson ◽

...

Keyword(s):

Learning Algorithm ◽

Named Entity Recognition ◽

Entity Recognition ◽

Target Language ◽

Test Case ◽

Named Entity ◽

Meta Learning ◽

Target Languages ◽

Cross Lingual ◽

The Given

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model's generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.

Download Full-text

Deep Persian sentiment analysis: Cross-lingual training for low-resource languages

Journal of Information Science ◽

10.1177/0165551520962781 ◽

2020 ◽

pp. 016555152096278

Author(s):

Rouzbeh Ghasemi ◽

Seyed Arad Ashrafi Asli ◽

Saeedeh Momtazi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Low Resource ◽

Proposed Model ◽

Significant Difference ◽

Cross Lingual

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages

Natural Language Engineering ◽

10.1017/s1351324918000293 ◽

2018 ◽

Vol 25 (1) ◽

pp. 43-67

Author(s):

O. ZENNAKI ◽

N. SEMMAR ◽

L. BESACIER

Keyword(s):

Rapid Development ◽

Training Data ◽

External Information ◽

Parallel Corpora ◽

Low Resource ◽

Part Of Speech ◽

Linguistic Annotation ◽

Wide Range ◽

Target Languages ◽

Cross Lingual

AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger forNlanguages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).

Download Full-text

A classification approach for detecting cross-lingual biomedical term translations

Natural Language Engineering ◽

10.1017/s1351324915000431 ◽

2015 ◽

Vol 23 (1) ◽

pp. 31-51 ◽

Cited By ~ 3

Author(s):

H. HAKAMI ◽

D. BOLLEGALA

Keyword(s):

Machine Translation ◽

Feature Space ◽

Target Language ◽

Average Precision ◽

Common Features ◽

Target Languages ◽

Cross Lingual ◽

Translation Accuracy ◽

Translation Systems ◽

Technical Terms

AbstractFinding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.

Download Full-text

Reinforced Transformer with Cross-Lingual Distillation for Cross-Lingual Aspect Sentiment Classification

Electronics ◽

10.3390/electronics10030270 ◽

2021 ◽

Vol 10 (3) ◽

pp. 270

Author(s):

Hanqian Wu ◽

Zhike Wang ◽

Feng Qing ◽

Shoushan Li

Keyword(s):

General Purpose ◽

Sentiment Classification ◽

Training Data ◽

Target Language ◽

Source Language ◽

Domain Specific ◽

Novel Approach ◽

The Rich ◽

Target Languages ◽

Cross Lingual

Though great progress has been made in the Aspect-Based Sentiment Analysis(ABSA) task through research, most of the previous work focuses on English-based ABSA problems, and there are few efforts on other languages mainly due to the lack of training data. In this paper, we propose an approach for performing a Cross-Lingual Aspect Sentiment Classification (CLASC) task which leverages the rich resources in one language (source language) for aspect sentiment classification in a under-resourced language (target language). Specifically, we first build a bilingual lexicon for domain-specific training data to translate the aspect category annotated in the source-language corpus and then translate sentences from the source language to the target language via Machine Translation (MT) tools. However, most MT systems are general-purpose, it non-avoidably introduces translation ambiguities which would degrade the performance of CLASC. In this context, we propose a novel approach called Reinforced Transformer with Cross-Lingual Distillation (RTCLD) combined with target-sensitive adversarial learning to minimize the undesirable effects of translation ambiguities in sentence translation. We conduct experiments on different language combinations, treating English as the source language and Chinese, Russian, and Spanish as target languages. The experimental results show that our proposed approach outperforms the state-of-the-art methods on different target languages.

Download Full-text

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00225-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Zolzaya Byambadorj ◽

Ryota Nishimura ◽

Altangerel Ayush ◽

Kengo Ohta ◽

Norihide Kitaoka

Keyword(s):

Transfer Learning ◽

Data Augmentation ◽

Prediction Models ◽

Target Language ◽

Text To Speech ◽

Paired Data ◽

Low Resource ◽

Language Data ◽

Single Speaker ◽

Cross Lingual

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Download Full-text