Lost in Back-Translation: Emotion Preservation in Neural Machine Translation

Parallel Corpora ◽

Translation Quality ◽

Sequence Generation ◽

Sequence Modeling ◽

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Neural Machine Translation Based on Back-Translation for Multilingual Translation Evaluation Task

Communications in Computer and Information Science - Machine Translation ◽

10.1007/978-981-33-6162-1_13 ◽

2020 ◽

pp. 132-141

Author(s):

Siyu Lai ◽

Yueting Yang ◽

Jin’an Xu ◽

Yufeng Chen ◽

Hui Huang

Keyword(s):

Machine Translation ◽

Evaluation Task ◽

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Low Resource ◽

Parallel Data ◽

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Unsupervised Neural Machine Translation with SMT as Posterior Regularization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301241 ◽

2019 ◽

Vol 33 ◽

pp. 241-248 ◽

Cited By ~ 3

Author(s):

Shuo Ren ◽

Zhirui Zhang ◽

Shujie Liu ◽

Ming Zhou ◽

Shuai Ma

Keyword(s):

Machine Translation ◽

Language Models ◽

Translation Process ◽

Weak Supervision ◽

Back Translation ◽

Negative Effect ◽

Model Training ◽

Cross Lingual ◽

Pseudo Data

Without real bilingual corpus available, unsupervised Neural Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo data inevitably contain noises and errors that will be accumulated and reinforced in the subsequent training process, leading to bad translation performance. To address this issue, we introduce phrase based Statistic Machine Translation (SMT) models which are robust to noisy data, as posterior regularizations to guide the training of unsupervised NMT models in the iterative back-translation process. Our method starts from SMT models built with pre-trained language models and word-level translation tables inferred from cross-lingual embeddings. Then SMT and NMT models are optimized jointly and boost each other incrementally in a unified EM framework. In this way, (1) the negative effect caused by errors in the iterative back-translation process can be alleviated timely by SMT filtering noises from its phrase tables; meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in SMT. Experiments conducted on en-fr and en-de translation tasks show that our method outperforms the strong baseline and achieves new state-of-the-art unsupervised machine translation performance.

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

Entropy ◽

10.3390/e21121213 ◽

2019 ◽

Vol 21 (12) ◽

pp. 1213

Author(s):

Guanghao Xu ◽

Youngjoong Ko ◽

Jungyun Seo

Keyword(s):

Machine Translation ◽

Synthetic Data ◽

Similarity Score ◽

Target Language ◽

Novel Approach ◽

Parallel Data ◽

Back Translation ◽

Translation Errors ◽

Training State

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.

Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation

10.18653/v1/d18-1040 ◽

2018 ◽

Cited By ~ 2

Author(s):

Marzieh Fadaee ◽

Christof Monz

Keyword(s):

Machine Translation ◽

Communications in Computer and Information Science - Information and Communication Technology and Applications ◽

Enhanced Back-Translation for Low Resource Neural Machine Translation Using Self-training

10.1007/978-3-030-69143-1_28 ◽

2021 ◽

pp. 355-371

Author(s):

Idris Abdulmumin ◽

Bashir Shehu Galadanci ◽

Abubakar Isa

Keyword(s):

Machine Translation ◽

Low Resource ◽

More Data Is Better Only to Some Level, After Which It Is Harmful: Profiling Neural Machine Translation Self-learning with Back-Translation

10.1007/978-3-030-86230-5_57 ◽

2021 ◽

pp. 727-738

Author(s):

Rodrigo Santos ◽

João Silva ◽

António Branco

Keyword(s):

Machine Translation ◽

Back Translation ◽

Self Learning

PHƯƠNG PHÁP TĂNG CƯỜNG DỮ LIỆU HUẤN LUYỆN DỊCH MÁY THỐNG KÊ CẶP NGÔN NGỮ VIỆT-ANH BẰNG KỸ THUẬT BACK - TRANSLATION VÀ LỰA CHỌN THÍCH NGHI

Journal of Military Science and Technology ◽

10.54939/1859-1043.j.mst...23-32 ◽

2020 ◽

pp. 23-32

Author(s):

Đặng Thanh Quyền

Keyword(s):

Machine Translation ◽

Dịch ngược (Back-translation - BT) đã được sử dụng rộng rãi và trở thành một trong những kỹ thuật tiêu chuẩn để tăng cường dữ liệu trong dịch máy bằng nơ-ron (Neural Machine Translation - NMT). Việc sử dụng BT đã được chứng minh là có hiệu quả trong việc cải thiện hiệu suất dịch thuật, đặc biệt đối với các trường hợp tài nguyên hạn chế. Hiện nay, phần lớn các nghiên cứu liên quan đến BT chủ yếu tập trung vào các ngôn ngữ châu Âu, chỉ một số ít nghiên cứu về dịch thuật ngôn ngữ ở các khu vực khác trên thế giới. Trong bài báo này, chúng tôi nghiên cứu, áp dụng BT để tăng chất lượng dữ liệu huấn luyện cho dịch máy thống kê cặp ngôn ngữ Việt-Anh (là cặp ngôn ngữ có nguồn dữ liệu hạn chế). Phương pháp đề xuất sử dụng ngôn ngữ trung gian cho BT là tiếng Đức. Các câu tiếng Anh ở dữ liệu huấn luyện ban đầu được dịch sang tiếng Đức sau đó dịch trở lại từ tiếng Đức sang tiếng Anh để tạo ra các câu tiếng Anh mới có nghĩa tương đương với các câu gốc. Một số độ đo thích nghi được đề xuất để đánh giá tập câu tiếng Anh thu được, lựa chọn các câu được đánh giá “tốt” để thêm vào dữ liệu huấn luyện ban đầu. Kết quả thử nghiệm trên hệ thống dịch máy thống kê MOSES với cặp ngôn ngữ Việt-Anh cho thấy nếu thêm toàn bộ câu được sinh ra bởi BT vào tập dữ liệu huấn luyện mà không áp dụng việc lựa chọn dữ liệu thì kết quả không tăng lên so với việc sử dụng dữ liệu huấn luyện ban đầu. Trong khi đó, việc áp dụng các kỹ thuật lựa chọn dữ liệu thích nghi đạt được kết quả BLEU tăng lên với kết quả tốt nhất tăng thêm 0.8 điểm BLEU.

Neural machine translation of clinical texts between long distance languages

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz110 ◽

2019 ◽

Vol 26 (12) ◽

pp. 1478-1487 ◽

Cited By ~ 1

Author(s):

Xabier Soto ◽

Olatz Perez-de-Viñaspre ◽

Gorka Labaka ◽

Maite Oronoz

Keyword(s):

Machine Translation ◽

Snomed Ct ◽

Long Distance ◽

Health Records ◽

Training Corpus ◽

Human Evaluation ◽

Back Translation ◽

Translation Systems ◽

Clinical Domain

Abstract Objective To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish. Materials and Methods We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation. Results We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations. Discussion We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates. Conclusion We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.