Data Augmentation for Low-Resource Neural Machine Translation

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.

Download Full-text

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Back Translation

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Download Full-text

Tackling neural machine translation in low-resource settings: a Portuguese case study

10.5753/stil.2021.17807 ◽

2021 ◽

Author(s):

Arthur T. Estrella ◽

João B. O. Souza Filho

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Word Embeddings ◽

Effective Solution ◽

Computational Power ◽

Limited Data ◽

Neural Machine Translation ◽

Low Resource

Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.

Download Full-text

Extremely low-resource neural machine translation for Asian languages

Machine Translation ◽

10.1007/s10590-020-09258-6 ◽

2020 ◽

Vol 34 (4) ◽

pp. 347-382

Author(s):

Raphael Rubino ◽

Benjamin Marie ◽

Raj Dabre ◽

Atushi Fujita ◽

Masao Utiyama ◽

...

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Statistical Machine Translation ◽

Synthetic Data ◽

Parameter Tuning ◽

Data Generation ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Asian Languages

AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.

Download Full-text

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

Information ◽

10.3390/info11050255 ◽

2020 ◽

Vol 11 (5) ◽

pp. 255

Author(s):

Yu Li ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong

Keyword(s):

Machine Translation ◽

English Translation ◽

Data Augmentation ◽

Sampling Strategy ◽

Training Data ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Augmentation Strategy ◽

Diverse Data

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.

Download Full-text

Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach

10.18653/v1/2021.emnlp-main.669 ◽

2021 ◽

Author(s):

Víctor M. Sánchez-Cartagena ◽

Miquel Esplà-Gomis ◽

Juan Antonio Pérez-Ortiz ◽

Felipe Sánchez-Martínez

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Learning Approach ◽

Neural Machine Translation ◽

Low Resource ◽

Task Learning

Download Full-text

Neural machine translation with a polysynthetic low resource language

Machine Translation ◽

10.1007/s10590-020-09255-9 ◽

2020 ◽

Vol 34 (4) ◽

pp. 325-346

Author(s):

John E. Ortega ◽

Richard Castro Mamani ◽

Kyunghyun Cho

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Improving thai-lao neural machine translation with similarity lexicon

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-212236 ◽

2021 ◽

pp. 1-10

Author(s):

Zhiqiang Yu ◽

Yuxin Huang ◽

Junjun Guo

Keyword(s):

Machine Translation ◽

Semantic Information ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Decoder Architecture ◽

Baseline System ◽

Input Sentence ◽

Resource Conditions ◽

Language Pair

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions. Thai-Lao is a typical low-resource language pair of tiny parallel corpus, leading to suboptimal NMT performance on it. However, Thai and Lao have considerable similarities in linguistic morphology and have bilingual lexicon which is relatively easy to obtain. To use this feature, we first build a bilingual similarity lexicon composed of pairs of similar words. Then we propose a novel NMT architecture to leverage the similarity between Thai and Lao. Specifically, besides the prevailing sentence encoder, we introduce an extra similarity lexicon encoder into the conventional encoder-decoder architecture, by which the semantic information carried by the similarity lexicon can be represented. We further provide a simple mechanism in the decoder to balance the information representations delivered from the input sentence and the similarity lexicon. Our approach can fully exploit linguistic similarity carried by the similarity lexicon to improve translation quality. Experimental results demonstrate that our approach achieves significant improvements over the state-of-the-art Transformer baseline system and previous similar works.

Download Full-text

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Machine Translation ◽

10.1007/s10590-021-09292-y ◽

2022 ◽

Author(s):

Jyotsana Khatri ◽

Rudra Murthy ◽

Tamali Banerjee ◽

Pushpak Bhattacharyya

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Low Resource

Download Full-text