Encoder-Decoder Attention ≠ Word Alignment: Axiomatic Method of Learning Word Alignments for Neural Machine Translation

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

On the Word Alignment from Neural Machine Translation

10.18653/v1/p19-1124 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xintong Li ◽

Guanlin Li ◽

Lemao Liu ◽

Max Meng ◽

Shuming Shi

Keyword(s):

Machine Translation ◽

Word Alignment ◽

Neural Machine Translation

Download Full-text

Accurate Word Alignment Induction from Neural Machine Translation

10.18653/v1/2020.emnlp-main.42 ◽

2020 ◽

Author(s):

Yun Chen ◽

Yang Liu ◽

Guanhua Chen ◽

Xin Jiang ◽

Qun Liu

Keyword(s):

Machine Translation ◽

Word Alignment ◽

Neural Machine Translation

Download Full-text

A Relationship: Word Alignment, Phrase Table, and Translation Quality

The Scientific World JOURNAL ◽

10.1155/2014/438106 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Liang Tian ◽

Derek F. Wong ◽

Lidia S. Chao ◽

Francisco Oliveira

Keyword(s):

Machine Translation ◽

Ad Hoc ◽

Significant Loss ◽

The Other ◽

Word Alignment ◽

Translation Quality ◽

Theoretical Support ◽

Word Alignments ◽

The Relationship ◽

Pruning Technique

In the last years, researchers conducted several studies to evaluate the machine translation quality based on the relationship between word alignments and phrase table. However, existing methods usually employ ad-hoc heuristics without theoretical support. So far, there is no discussion from the aspect of providing a formula to describe the relationship among word alignments, phrase table, and machine translation performance. In this paper, on one hand, we focus on formulating such a relationship for estimating the size of extracted phrase pairs given one or more word alignment points. On the other hand, a corpus-motivated pruning technique is proposed to prune the default large phrase table. Experiment proves that the deduced formula is feasible, which not only can be used to predict the size of the phrase table, but also can be a valuable reference for investigating the relationship between the translation performance and phrase tables based on different links of word alignment. The corpus-motivated pruning results show that nearly 98% of phrases can be reduced without any significant loss in translation quality.

Download Full-text

Research on Mongolian-Chinese machine translation based on the end-to-end neural network

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691319410030 ◽

2019 ◽

Vol 18 (01) ◽

pp. 1941003 ◽

Cited By ~ 1

Author(s):

Ren Qing-Dao-Er-Ji ◽

Yila Su ◽

Nier Wu

Keyword(s):

Neural Network ◽

Machine Translation ◽

Language Processing ◽

Conditional Random Field ◽

Target Language ◽

Word Alignment ◽

Neural Machine Translation ◽

Attention Model ◽

End To End ◽

Improved Model

With the development of natural language processing and neural machine translation, the neural machine translation method of end-to-end (E2E) neural network model has gradually become the focus of research because of its high translation accuracy and strong semantics of translation. However, there are still problems such as limited vocabulary and low translation loyalty, etc. In this paper, the discriminant method and the Conditional Random Field (CRF) model were used to segment and label the stem and affixes of Mongolian in the preprocessing stage of Mongolian-Chinese bilingual corpus. Aiming at the low translation loyalty problem, a decoding model combining Convolution Neural Network (CNN) and Gated Recurrent Unit (GRU) was constructed. The target language decoding was performed by using the GRU. A global attention model was used to obtain the bilingual word alignment information in the process of bilingual word alignment processing. Finally, the quality of the translation was evaluated by Bilingual Evaluation Understudy (BLEU) values and Perplexity (PPL) values. The improved model yields a BLEU value of 25.13 and a PPL value of [Formula: see text]. The experimental results show that the E2E Mongolian-Chinese neural machine translation model was improved in terms of translation quality and semantic confusion compared with traditional statistical methods and machine translation models based on Recurrent Neural Networks (RNN).

Download Full-text

Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation

Computational Linguistics ◽

10.1162/coli.2010.36.2.09054 ◽

2010 ◽

Vol 36 (2) ◽

pp. 247-277 ◽

Cited By ~ 11

Author(s):

Wei Wang ◽

Jonathan May ◽

Kevin Knight ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Syntactic Structure ◽

Statistical Machine Translation ◽

Training Data ◽

Word Alignment ◽

The Em Algorithm ◽

Rule Application ◽

Word Alignments ◽

Parse Trees

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.

Download Full-text