Sentence Augmentation for Language Translation Using GPT-2

Ranto Sawai; Incheon Paik; Ayato Kuwana

doi:10.3390/electronics10243082

Sentence Augmentation for Language Translation Using GPT-2

Electronics ◽

10.3390/electronics10243082 ◽

2021 ◽

Vol 10 (24) ◽

pp. 3082

Author(s):

Ranto Sawai ◽

Incheon Paik ◽

Ayato Kuwana

Keyword(s):

Machine Translation ◽

Additional Data ◽

Data Augmentation ◽

Language Translation ◽

Data Generation ◽

Neural Machine Translation ◽

Current State ◽

Back Translation ◽

Efficient Data ◽

Made In

Data augmentation has recently become an important method for improving performance in deep learning. It is also a significant issue in machine translation, and various innovations such as back-translation and noising have been made. In particular, current state-of-the-art model architectures such as BERT-fused or efficient data generation using the GPT model provide good inspiration to improve the translation performance. In this study, we propose the generation of additional data for neural machine translation (NMT) using a sentence generator by GPT-2 that produces similar characteristics to the original. BERT-fused architecture and back-translation are employed for the translation architecture. In our experiments, the model produced BLEU scores of 27.50 for tatoebaEn-Ja, 30.14 for WMT14En-De, and 24.12 for WMT18En-Ch.

Download Full-text

Improved neural machine translation for low-resource English–Assamese pair

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219260 ◽

2021 ◽

pp. 1-12

Author(s):

Sahinur Rahman Laskar ◽

Abdullah Faiz Ur Rahman Khilji ◽

Partha Pakray ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Language Translation ◽

Linguistically Diverse ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

The World ◽

Translation Accuracy ◽

Vocabulary Problems

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.

Download Full-text

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Back Translation

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Download Full-text

Data augmentation using back-translation for context-aware neural machine translation

10.18653/v1/d19-6504 ◽

2019 ◽

Cited By ~ 2

Author(s):

Amane Sugiyama ◽

Naoki Yoshinaga

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Context Aware ◽

Neural Machine Translation ◽

Back Translation

Download Full-text

Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation

10.18653/v1/d19-5543 ◽

2019 ◽

Author(s):

Zhenhao Li ◽

Lucia Specia

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Neural Machine Translation ◽

Back Translation

Download Full-text

Extremely low-resource neural machine translation for Asian languages

Machine Translation ◽

10.1007/s10590-020-09258-6 ◽

2020 ◽

Vol 34 (4) ◽

pp. 347-382

Author(s):

Raphael Rubino ◽

Benjamin Marie ◽

Raj Dabre ◽

Atushi Fujita ◽

Masao Utiyama ◽

...

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Statistical Machine Translation ◽

Synthetic Data ◽

Parameter Tuning ◽

Data Generation ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Asian Languages

AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.

Download Full-text

SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation

10.18653/v1/d18-1100 ◽

2018 ◽

Cited By ~ 5

Author(s):

Xinyi Wang ◽

Hieu Pham ◽

Zihang Dai ◽

Graham Neubig

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Neural Machine Translation ◽

Augmentation Algorithm ◽

Efficient Data

Download Full-text

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016292 ◽

2019 ◽

Vol 33 ◽

pp. 6292-6299 ◽

Cited By ~ 2

Author(s):

Raj Dabre ◽

Atsushi Fujita

Keyword(s):

Machine Translation ◽

Single Layer ◽

Training Data ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Sequence Generation ◽

Sequence Modeling ◽

Back Translation

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Download Full-text

Neural Machine Translation Based on Back-Translation for Multilingual Translation Evaluation Task

Communications in Computer and Information Science - Machine Translation ◽

10.1007/978-981-33-6162-1_13 ◽

2020 ◽

pp. 132-141

Author(s):

Siyu Lai ◽

Yueting Yang ◽

Jin’an Xu ◽

Yufeng Chen ◽

Hui Huang

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Evaluation Task ◽

Back Translation

Download Full-text

Identifying Semantics in Clinical Reports Using Neural Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019552 ◽

2019 ◽

Vol 33 ◽

pp. 9552-9557

Author(s):

Srikanth Mujjiga ◽

Vamsi Krishna ◽

Kalyan Chakravarthi ◽

Vijayananda J

Keyword(s):

Machine Translation ◽

Large Scale ◽

Language Translation ◽

Semantic Search ◽

Neural Machine Translation ◽

Healthcare Facilities ◽

Semantic Level ◽

Concept Space ◽

Unique Approach ◽

Retrieval Problem

Clinical documents are vital resources for radiologists when they have to consult or refer while studying similar cases. In large healthcare facilities where millions of reports are generated, searching for relevant documents is quite challenging. With abundant interchangeable words in clinical domain, understanding the semantics of the words in the clinical documents is vital to improve the search results. This paper details an end to end semantic search application to address the large scale information retrieval problem of clinical reports. The paper specifically focuses on the challenge of identifying semantics in the clinical reports to facilitate search at semantic level. The semantic search works by mapping the documents into the concept space and the search is performed in the concept space. A unique approach of framing the concept mapping problem as a language translation problem is proposed in this paper. The concept mapper is modelled using the Neural machine translation model (NMT) based on encoder-decoder with attention architecture. The regular expression based concept mapper takes approximately 3 seconds to extract UMLS concepts from a single document, where as the trained NMT does the same in approximately 30 milliseconds. NMT based model further enables incorporation of negation detection to identify whether a concept is negated or not, facilitating search for negated queries.

Download Full-text

Unsupervised Neural Machine Translation with SMT as Posterior Regularization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301241 ◽

2019 ◽

Vol 33 ◽

pp. 241-248 ◽

Cited By ~ 3

Author(s):

Shuo Ren ◽

Zhirui Zhang ◽

Shujie Liu ◽

Ming Zhou ◽

Shuai Ma

Keyword(s):

Machine Translation ◽

Language Models ◽

Translation Process ◽

Weak Supervision ◽

Neural Machine Translation ◽

Back Translation ◽

Negative Effect ◽

Model Training ◽

Cross Lingual ◽

Pseudo Data

Without real bilingual corpus available, unsupervised Neural Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo data inevitably contain noises and errors that will be accumulated and reinforced in the subsequent training process, leading to bad translation performance. To address this issue, we introduce phrase based Statistic Machine Translation (SMT) models which are robust to noisy data, as posterior regularizations to guide the training of unsupervised NMT models in the iterative back-translation process. Our method starts from SMT models built with pre-trained language models and word-level translation tables inferred from cross-lingual embeddings. Then SMT and NMT models are optimized jointly and boost each other incrementally in a unified EM framework. In this way, (1) the negative effect caused by errors in the iterative back-translation process can be alleviated timely by SMT filtering noises from its phrase tables; meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in SMT. Experiments conducted on en-fr and en-de translation tasks show that our method outperforms the strong baseline and achieves new state-of-the-art unsupervised machine translation performance.

Download Full-text