scholarly journals Augmenting Neural Machine Translation through Round-Trip Training Approach

2019 ◽  
Vol 9 (1) ◽  
pp. 268-278 ◽  
Author(s):  
Benyamin Ahmadnia ◽  
Bonnie J. Dorr

AbstractThe quality of Neural Machine Translation (NMT), as a data-driven approach, massively depends on quantity, quality and relevance of the training dataset. Such approaches have achieved promising results for bilingually high-resource scenarios but are inadequate for low-resource conditions. Generally, the NMT systems learn from millions of words from bilingual training dataset. However, human labeling process is very costly and time consuming. In this paper, we describe a round-trip training approach to bilingual low-resource NMT that takes advantage of monolingual datasets to address training data bottleneck, thus augmenting translation quality. We conduct detailed experiments on English-Spanish as a high-resource language pair as well as Persian-Spanish as a low-resource language pair. Experimental results show that this competitive approach outperforms the baseline systems and improves translation quality.

2021 ◽  
pp. 1-10
Author(s):  
Zhiqiang Yu ◽  
Yuxin Huang ◽  
Junjun Guo

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions. Thai-Lao is a typical low-resource language pair of tiny parallel corpus, leading to suboptimal NMT performance on it. However, Thai and Lao have considerable similarities in linguistic morphology and have bilingual lexicon which is relatively easy to obtain. To use this feature, we first build a bilingual similarity lexicon composed of pairs of similar words. Then we propose a novel NMT architecture to leverage the similarity between Thai and Lao. Specifically, besides the prevailing sentence encoder, we introduce an extra similarity lexicon encoder into the conventional encoder-decoder architecture, by which the semantic information carried by the similarity lexicon can be represented. We further provide a simple mechanism in the decoder to balance the information representations delivered from the input sentence and the similarity lexicon. Our approach can fully exploit linguistic similarity carried by the similarity lexicon to improve translation quality. Experimental results demonstrate that our approach achieves significant improvements over the state-of-the-art Transformer baseline system and previous similar works.


Information ◽  
2019 ◽  
Vol 10 (9) ◽  
pp. 273
Author(s):  
Rejwanul Haque ◽  
Mohammed Hasanuzzaman ◽  
Andy Way

Term translation quality in machine translation (MT), which is usually measured by domain experts, is a time-consuming and expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems often need to be updated for many reasons (e.g., availability of new training data, leading MT techniques). To the best of our knowledge, as of yet, there is no publicly-available solution to evaluate terminology translation in MT automatically. Hence, there is a genuine need to have a faster and less-expensive solution to this problem, which could help end-users to identify term translation problems in MT instantly. This study presents a faster and less expensive strategy for evaluating terminology translation in MT. High correlations of our evaluation results with human judgements demonstrate the effectiveness of the proposed solution. The paper also introduces a classification framework, TermCat, that can automatically classify term translation-related errors and expose specific problems in relation to terminology translation in MT. We carried out our experiments with a low resource language pair, English–Hindi, and found that our classifier, whose accuracy varies across the translation directions, error classes, the morphological nature of the languages, and MT models, generally performs competently in the terminology translation classification task.


2021 ◽  
pp. 248-262
Author(s):  
Jörg Tiedemann

This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.


Author(s):  
Raj Dabre ◽  
Atsushi Fujita

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.


2020 ◽  
pp. 1-22
Author(s):  
Sukanta Sen ◽  
Mohammed Hasanuzzaman ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya ◽  
Andy Way

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.


Author(s):  
Tetiana Korolova ◽  
Natalya Zhmayeva ◽  
Yulia Kolchah

Modern industry of translation services singles out two translation quality levels that can be reached as a result of machine translation (MT) post-editing: good enough quality foresees rendering the main information of the source message, admitting stylistic, syntactic and morphological flaws while quality similar or equal to human translation is a full dress version of a post-edited text, ready to be published. The overview of MT systems enables us to consider Google Neural Machine Translation (GNMT) which is based on the most modern methods of training to reach maximum improvements the most powerful one. When analyzing texts translated by means of Google Translate the following problems were identified: distortion of the referential meaning of the source message, incorrect choice of variant equivalences, lack of terms harmonization, lack of abbreviations rendering, inconformity of linguistic units in persons, numbers and cases, incorrect choice of functional correspondings when rendering absolute constructions, gerund and participial constructions, literal translation of phrases, lack of transformations of the grammatical structure of the source message (additions, rearrangements). Taking into account the classified issues of machine translation as well as the levels of post-editing quality post-editing of the texts translated by means of MT is carried out, demands and recommendations applicable to post-editing results of MT within the language pair under analysis with respect to peculiarities of the specific MT system and the type of translated texts are provided.


Author(s):  
Kamal Kumar Gupta ◽  
Sukanta Sen ◽  
Rejwanul Haque ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya ◽  
...  

2019 ◽  
Vol 9 (10) ◽  
pp. 2036
Author(s):  
Jinyi Zhang ◽  
Tadahiro Matsumoto

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.


2019 ◽  
Vol 252 ◽  
pp. 03006
Author(s):  
Ualsher Tukeyev ◽  
Aidana Karibayeva ◽  
Balzhan Abduali

The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs.


2021 ◽  
Vol 11 (7) ◽  
pp. 2948
Author(s):  
Lucia Benkova ◽  
Dasa Munkova ◽  
Ľubomír Benko ◽  
Michal Munk

This study is focused on the comparison of phrase-based statistical machine translation (SMT) systems and neural machine translation (NMT) systems using automatic metrics for translation quality evaluation for the language pair of English and Slovak. As the statistical approach is the predecessor of neural machine translation, it was assumed that the neural network approach would generate results with a better quality. An experiment was performed using residuals to compare the scores of automatic metrics of the accuracy (BLEU_n) of the statistical machine translation with those of the neural machine translation. The results showed that the assumption of better neural machine translation quality regardless of the system used was confirmed. There were statistically significant differences between the SMT and NMT in favor of the NMT based on all BLEU_n scores. The neural machine translation achieved a better quality of translation of journalistic texts from English into Slovak, regardless of if it was a system trained on general texts, such as Google Translate, or specific ones, such as the European Commission’s (EC’s) tool, which was trained on a specific-domain.


Sign in / Sign up

Export Citation Format

Share Document