Improving Stylized Neural Machine Translation with Iterative Dual Knowledge Transfer

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/547 ◽

2021 ◽

Author(s):

Xuanxuan Wu ◽

Jian Liu ◽

Xinjie Li ◽

Jinan Xu ◽

Yufeng Chen ◽

...

Keyword(s):

Knowledge Transfer ◽

Machine Translation ◽

Large Scale ◽

Training Data ◽

Transfer Model ◽

Paired Data ◽

Style Transfer ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Data

Stylized neural machine translation (NMT) aims to translate sentences of one style into sentences of another style, which is essential for the application of machine translation in a real-world scenario. However, a major challenge in this task is the scarcity of high-quality parallel data which is stylized paired. To address this problem, we propose an iterative dual knowledge transfer framework that utilizes informal training data of machine translation and formality style transfer data to create large-scale stylized paired data, for the training of stylized machine translation model. Specifically, we perform bidirectional knowledge transfer between translation model and text style transfer model iteratively through knowledge distillation. Then, we further propose a data-refinement module to process the noisy synthetic parallel data generated during knowledge transfer. Experiment results demonstrate the effectiveness of our method, achieving an improvement over the existing best model by 5 BLEU points on MTFC dataset. Meanwhile, extensive analyses illustrate our method can also improve the accuracy of formality style transfer.

Download Full-text

The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation

Multilingual Facilitation ◽

10.31885/9789515150257.22 ◽

2021 ◽

pp. 248-262

Author(s):

Jörg Tiedemann

Keyword(s):

Machine Translation ◽

Training Data ◽

Sufficient Information ◽

Data Set ◽

Neural Machine Translation ◽

High Resource ◽

Language Groups ◽

Comprehensive Data ◽

Parallel Data ◽

Initial Results

This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.

Download Full-text

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Back Translation

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Download Full-text

Neural machine translation of low-resource languages using SMT phrase pair injection

Natural Language Engineering ◽

10.1017/s1351324920000303 ◽

2020 ◽

pp. 1-22

Author(s):

Sukanta Sen ◽

Mohammed Hasanuzzaman ◽

Asif Ekbal ◽

Pushpak Bhattacharyya ◽

Andy Way

Keyword(s):

Machine Translation ◽

Large Scale ◽

Production Systems ◽

Statistical Machine Translation ◽

Training Data ◽

Original Training ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Better Than

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Download Full-text

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Applied Sciences ◽

10.3390/app9102036 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2036

Author(s):

Jinyi Zhang ◽

Tadahiro Matsumoto

Keyword(s):

Machine Translation ◽

Scientific Paper ◽

Training Data ◽

Word Alignment ◽

Sentence Pair ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Parallel Data ◽

Source Sentence

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

Neural machine translation system for the Kazakh language based on synthetic corpora

MATEC Web of Conferences ◽

10.1051/matecconf/201925203006 ◽

2019 ◽

Vol 252 ◽

pp. 03006

Author(s):

Ualsher Tukeyev ◽

Aidana Karibayeva ◽

Balzhan Abduali

Keyword(s):

Machine Translation ◽

Training Data ◽

Translation System ◽

Natural Languages ◽

Neural Machine Translation ◽

Translation Quality ◽

Parallel Data ◽

Machine Translation System ◽

Turkic Languages

The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs.

Download Full-text

Tag-less Back-Translation

10.21203/rs.3.rs-465941/v1 ◽

2021 ◽

Author(s):

Idris Abdulmumin ◽

Bashir Shehu Galadanci ◽

Aliyu Garba

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Fine Tuning ◽

Huge Amount ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Data ◽

Back Translation ◽

Authentic Data ◽

Target Side

Abstract An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of the back-translations of the target-side monolingual data. The standard back-translation method has been shown to be unable to efficiently utilize the available huge amount of existing monolingual data because of the inability of translation models to differentiate between the authentic and synthetic parallel data during training. Tagging, or using gates, has been used to enable translation models to distinguish between synthetic and authentic data, improving standard back-translation and also enabling the use of iterative back-translation on language pairs that underperformed using standard back-translation. In this work, we approach back-translation as a domain adaptation problem, eliminating the need for explicit tagging. In the approach - tag-less back-translation - the synthetic and authentic parallel data are treated as out-of-domain and in-domain data respectively and, through pre-training and fine-tuning, the translation model is shown to be able to learn more efficiently from them during training. Experimental results have shown that the approach outperforms the standard and tagged back-translation approaches on low resource English-Vietnamese and English-German neural machine translation.

Download Full-text

Polygon-Net: A General Framework for Jointly Boosting Multiple Unsupervised Neural Machine Translation Models

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/739 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chang Xu ◽

Tao Qin ◽

Gang Wang ◽

Tie-Yan Liu

Keyword(s):

Machine Translation ◽

Loss Function ◽

General Framework ◽

Large Scale ◽

Great Success ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Benchmark Datasets ◽

First Time

Neural machine translation (NMT) has achieved great success. However, collecting large-scale parallel data for training is costly and laborious. Recently, unsupervised neural machine translation has attracted more and more attention, due to its demand for monolingual corpus only, which is common and easy to obtain, and its great potentials for the low-resource or even zero-resource machine translation. In this work, we propose a general framework called Polygon-Net, which leverages multi auxiliary languages for jointly boosting unsupervised neural machine translation models. Specifically, we design a novel loss function for multi-language unsupervised neural machine translation. In addition, different from the literature that just updating one or two models individually, Polygon-Net enables multiple unsupervised models in the framework to update in turn and enhance each other for the first time. In this way, multiple unsupervised translation models are associated with each other for training to achieve better performance. Experiments on the benchmark datasets including UN Corpus and WMT show that our approach significantly improves over the two-language based methods, and achieves better performance with more languages introduced to the framework.

Download Full-text

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/685 ◽

2019 ◽

Cited By ~ 1

Author(s):

Shizhe Chen ◽

Qin Jin ◽

Jianlong Fu

Keyword(s):

Machine Translation ◽

Large Scale ◽

Learning Approach ◽

Model Learning ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Model ◽

Word Level ◽

Sentence Level ◽

Progressive Learning

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.

Download Full-text

A Survey on Low-Resource Neural Machine Translation

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/629 ◽

2021 ◽

Author(s):

Rui Wang ◽

Xu Tan ◽

Renqian Luo ◽

Tao Qin ◽

Tie-Yan Liu

Keyword(s):

Machine Translation ◽

Large Scale ◽

State Of The Art ◽

Neural Machine Translation ◽

Modal Data ◽

Low Resource ◽

Resource Setting ◽

Low Resource Setting ◽

Parallel Data ◽

Target Languages

Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data. Thus, a lot of research has been conducted for neural machine translation (NMT) with very limited parallel data, i.e., the low-resource setting. In this paper, we provide a survey for low-resource NMT and classify related works into three categories according to the auxiliary data they used: (1) exploiting monolingual data of source and/or target languages, (2) exploiting data from auxiliary languages, and (3) exploiting multi-modal data. We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms, and help industry practitioners to choose appropriate algorithms for their applications.

Download Full-text

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

Information ◽

10.3390/info11050255 ◽

2020 ◽

Vol 11 (5) ◽

pp. 255

Author(s):

Yu Li ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong

Keyword(s):

Machine Translation ◽

English Translation ◽

Data Augmentation ◽

Sampling Strategy ◽

Training Data ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Augmentation Strategy ◽

Diverse Data

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.

Download Full-text