PARABANK: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-Constrained Neural Machine Translation

We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that PARABANK’s paraphrases improve over PARANMT on both semantic similarity and fluency. Finally, we use PARABANK to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.

Download Full-text

Efficient and High-Quality Neural Machine Translation with OpenNMT

10.18653/v1/2020.ngt-1.25 ◽

2020 ◽

Author(s):

Guillaume Klein ◽

Dakun Zhang ◽

Clément Chouteau ◽

Josep Crego ◽

Jean Senellart

Keyword(s):

Machine Translation ◽

High Quality ◽

Neural Machine Translation

Download Full-text

A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/6140153 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Gong-Xu Luo ◽

Ya-Ting Yang ◽

Rui Dong ◽

Yan-Hong Chen ◽

Wen-Bo Zhang

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Training Methods ◽

Learning Method ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Back Translation

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.

Download Full-text

Translation of Medical Texts using Neural Networks

International Journal of Reliable and Quality E-Healthcare ◽

10.4018/ijrqeh.2016100104 ◽

2016 ◽

Vol 5 (4) ◽

pp. 51-66 ◽

Cited By ~ 5

Author(s):

Krzysztof Wolk ◽

Krzysztof P. Marasek

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

European Medicines Agency ◽

Translation System ◽

Training Methods ◽

Neural Machine Translation ◽

Machine Translation System ◽

Source Sentence ◽

Parallel Text ◽

Translation Systems

The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. A comparison and implementation of a medical translator is the main focus of our experiments.

Download Full-text

Neural Machine Translation With Noisy Lexical Constraints

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2020.2999724 ◽

2020 ◽

Vol 28 ◽

pp. 1864-1874

Author(s):

Huayang Li ◽

Guoping Huang ◽

Deng Cai ◽

Lemao Liu

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Lexical Constraints

Download Full-text

Identifying Semantics in Clinical Reports Using Neural Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019552 ◽

2019 ◽

Vol 33 ◽

pp. 9552-9557

Author(s):

Srikanth Mujjiga ◽

Vamsi Krishna ◽

Kalyan Chakravarthi ◽

Vijayananda J

Keyword(s):

Machine Translation ◽

Large Scale ◽

Language Translation ◽

Semantic Search ◽

Neural Machine Translation ◽

Healthcare Facilities ◽

Semantic Level ◽

Concept Space ◽

Unique Approach ◽

Retrieval Problem

Clinical documents are vital resources for radiologists when they have to consult or refer while studying similar cases. In large healthcare facilities where millions of reports are generated, searching for relevant documents is quite challenging. With abundant interchangeable words in clinical domain, understanding the semantics of the words in the clinical documents is vital to improve the search results. This paper details an end to end semantic search application to address the large scale information retrieval problem of clinical reports. The paper specifically focuses on the challenge of identifying semantics in the clinical reports to facilitate search at semantic level. The semantic search works by mapping the documents into the concept space and the search is performed in the concept space. A unique approach of framing the concept mapping problem as a language translation problem is proposed in this paper. The concept mapper is modelled using the Neural machine translation model (NMT) based on encoder-decoder with attention architecture. The regular expression based concept mapper takes approximately 3 seconds to extract UMLS concepts from a single document, where as the trained NMT does the same in approximately 30 milliseconds. NMT based model further enables incorporation of negation detection to identify whether a concept is negated or not, facilitating search for negated queries.

Download Full-text

Synchronous Bidirectional Neural Machine Translation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00256 ◽

2019 ◽

Vol 7 ◽

pp. 91-105 ◽

Cited By ~ 8

Author(s):

Long Zhou ◽

Jiajun Zhang ◽

Chengqing Zong

Keyword(s):

Machine Translation ◽

Large Scale ◽

State Of The Art ◽

Target Language ◽

Single Model ◽

Neural Machine Translation ◽

German Translation ◽

Transformer Model ◽

Target Side ◽

Future Information

Existing approaches to neural machine translation (NMT) generate the target language sequence token-by-token from left to right. However, this kind of unidirectional decoding framework cannot make full use of the target-side future contexts which can be produced in a right-to-left decoding direction, and thus suffers from the issue of unbalanced outputs. In this paper, we introduce a synchronous bidirectional–neural machine translation (SB-NMT) that predicts its outputs using left-to-right and right-to-left decoding simultaneously and interactively, in order to leverage both of the history and future information at the same time. Specifically, we first propose a new algorithm that enables synchronous bidirectional decoding in a single model. Then, we present an interactive decoding model in which left-to-right (right-to-left) generation does not only depend on its previously generated outputs, but also relies on future contexts predicted by right-to-left (left-to-right) decoding. We extensively evaluate the proposed SB-NMT model on large-scale NIST Chinese-English, WMT14 English-German, and WMT18 Russian-English translation tasks. Experimental results demonstrate that our model achieves significant improvements over the strong Transformer model by 3.92, 1.49, and 1.04 BLEU points, respectively, and obtains the state-of-the-art per- formance on Chinese-English and English- German translation tasks. 1

Download Full-text

Marian: Cost-effective High-Quality Neural Machine Translation in C++

10.18653/v1/w18-2716 ◽

2018 ◽

Author(s):

Marcin Junczys-Dowmunt ◽

Kenneth Heafield ◽

Hieu Hoang ◽

Roman Grundkiewicz ◽

Anthony Aue

Keyword(s):

Machine Translation ◽

Cost Effective ◽

High Quality ◽

Neural Machine Translation

Download Full-text

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

10.18653/v1/2021.emnlp-main.369 ◽

2021 ◽

Author(s):

Long Doan ◽

Linh The Nguyen ◽

Nguyen Luong Tran ◽

Thai Hoang ◽

Dat Quoc Nguyen

Keyword(s):

Machine Translation ◽

Large Scale ◽

Benchmark Dataset ◽

High Quality

Download Full-text

Neural machine translation of low-resource languages using SMT phrase pair injection

Natural Language Engineering ◽

10.1017/s1351324920000303 ◽

2020 ◽

pp. 1-22

Author(s):

Sukanta Sen ◽

Mohammed Hasanuzzaman ◽

Asif Ekbal ◽

Pushpak Bhattacharyya ◽

Andy Way

Keyword(s):

Machine Translation ◽

Large Scale ◽

Production Systems ◽

Statistical Machine Translation ◽

Training Data ◽

Original Training ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Better Than

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Download Full-text

Incorporating Source-Side Phrase Structures into Neural Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00348 ◽

2019 ◽

Vol 45 (2) ◽

pp. 267-292 ◽

Cited By ~ 4

Author(s):

Akiko Eriguchi ◽

Kazuma Hashimoto ◽

Yoshimasa Tsuruoka

Keyword(s):

Machine Translation ◽

Syntactic Structure ◽

Statistical Machine Translation ◽

Training Data ◽

Great Success ◽

Data Set ◽

Neural Machine Translation ◽

Proposed Model ◽

Source Sentence

Neural machine translation (NMT) has shown great success as a new alternative to the traditional Statistical Machine Translation model in multiple languages. Early NMT models are based on sequence-to-sequence learning that encodes a sequence of source words into a vector space and generates another sequence of target words from the vector. In those NMT models, sentences are simply treated as sequences of words without any internal structure. In this article, we focus on the role of the syntactic structure of source sentences and propose a novel end-to-end syntactic NMT model, which we call a tree-to-sequence NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our proposed model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. We have empirically compared the proposed model with sequence-to-sequence models in various settings on Chinese-to-Japanese and English-to-Japanese translation tasks. Our experimental results suggest that the use of syntactic structure can be beneficial when the training data set is small, but is not as effective as using a bi-directional encoder. As the size of training data set increases, the benefits of using a syntactic tree tends to diminish.

Download Full-text