Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation

Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.

Download Full-text

rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0012-y ◽

2012 ◽

Vol 98 (1) ◽

pp. 99-108 ◽

Cited By ~ 2

Author(s):

Maja Popović

Keyword(s):

Open Source ◽

Machine Translation ◽

Arithmetic Mean ◽

Automatic Evaluation ◽

Open Source Tool ◽

On Demand ◽

Sentence Level ◽

N Gram ◽

Document Level

rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output We describe RGBF, a tool for automatic evaluation of machine translation output based on n-gram precision and recall. The tool calculates the F-score averaged on all n-grams of an arbitrary set of distinct units such as words, morphemes, POS tags, etc. The arithmetic mean is used for n-gram averaging. As input, the tool requires reference translation(s) and hypothesis, both containing the same combination of units. The default output is the document level 4-gram F-score of the desired unit combination. The scores at the sentence level can be obtained on demand, as well as precision and/or recall scores, separate unit scores and separate n-gram scores. In addition, weights can be introduced both for n-grams and for units, as well as the desired n-gram order n.

Download Full-text

Enhancing Lexical Translation Consistency for Document-Level Neural Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3485469 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-21

Author(s):

Xiaomian Kang ◽

Yang Zhao ◽

Jiajun Zhang ◽

Chengqing Zong

Keyword(s):

Machine Translation ◽

English Translation ◽

Test Set ◽

Neural Machine Translation ◽

Global Context ◽

Translation Quality ◽

Sentence Level ◽

Document Level

Document-level neural machine translation (DocNMT) has yielded attractive improvements. In this article, we systematically analyze the discourse phenomena in Chinese-to-English translation, and focus on the most obvious ones, namely lexical translation consistency. To alleviate the lexical inconsistency, we propose an effective approach that is aware of the words which need to be translated consistently and constrains the model to produce more consistent translations. Specifically, we first introduce a global context extractor to extract the document context and consistency context, respectively. Then, the two types of global context are integrated into a encoder enhancer and a decoder enhancer to improve the lexical translation consistency. We create a test set to evaluate the lexical consistency automatically. Experiments demonstrate that our approach can significantly alleviate the lexical translation inconsistency. In addition, our approach can also substantially improve the translation quality compared to sentence-level Transformer.

Download Full-text

STD: An Automatic Evaluation Metric for Machine Translation Based on Word Embeddings

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2019.2922845 ◽

2019 ◽

Vol 27 (10) ◽

pp. 1497-1506 ◽

Cited By ~ 1

Author(s):

Pairui Li ◽

Chuan Chen ◽

Wujie Zheng ◽

Yuetang Deng ◽

Fanghua Ye ◽

...

Keyword(s):

Machine Translation ◽

Automatic Evaluation ◽

Word Embeddings ◽

Evaluation Metric

Download Full-text

Discourse Structure in Machine Translation Evaluation

Computational Linguistics ◽

10.1162/coli_a_00298 ◽

2017 ◽

Vol 43 (4) ◽

pp. 683-722 ◽

Cited By ~ 1

Author(s):

Shafiq Joty ◽

Francisco Guzmán ◽

Lluís Màrquez ◽

Preslav Nakov

Keyword(s):

Machine Translation ◽

Similarity Measures ◽

Discourse Structure ◽

System Level ◽

Structure Theory ◽

Evaluation Metrics ◽

Machine Translation Evaluation ◽

Sentence Level ◽

Relation Type ◽

Parse Trees

In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTK party. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.

Download Full-text

Predicting Human Assessment of Machine Translation Quality by Combining Automatic Evaluation Metrics using Binary Classifiers

International Journal of Computer Applications ◽

10.5120/9581-4062 ◽

2012 ◽

Vol 59 (10) ◽

pp. 1-7

Author(s):

Michael Paul ◽

Andrew Finch ◽

Eiichiro Sumita

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

Translation Quality ◽

Binary Classifiers ◽

Human Assessment

Download Full-text

Statistical Analysis of Machine Translation Evaluation Systems for English- Hindi Language Pair

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190716100145 ◽

2020 ◽

Vol 13 (5) ◽

pp. 864-870

Author(s):

Pooja Malik ◽

Y. Mrudula ◽

Anurag S. Baghel

Keyword(s):

Statistical Analysis ◽

Machine Translation ◽

Automatic Machine ◽

Evaluation Metrics ◽

Machine Translation Evaluation ◽

Evaluation Systems ◽

Evaluation Scores ◽

Sentence Level ◽

Set Up ◽

Language Pair

Background: Automatic Machine Translation (AMT) Evaluation Metrics have become popular in the Machine Translation Community in recent times. This is because of the popularity of Machine Translation engines and Machine Translation as a field itself. Translator is a very important tool to break barriers between communities especially in countries like India, where people speak 22 different languages and their many variations. With the onset of Machine Translation engines, there is a need for a system that evaluates how well these are performing. This is where machine translation evaluation enters. Objective: This paper discusses the importance of Automatic Machine Translation Evaluation and compares various Machine Translation Evaluation metrics by performing Statistical Analysis on various metrics and human evaluations to find out which metric has the highest correlation with human scores. Methods: The correlation between the Automatic and Human Evaluation Scores and the correlation between the five Automatic evaluation scores are examined at the sentence level. Moreover, a hypothesis is set up and p-values are calculated to find out how significant these correlations are. Results: The results of the statistical analysis of the scores of various metrics and human scores are shown in the form of graphs to see the trend of the correlation between the scores of Automatic Machine Translation Evaluation metrics and human scores. Conclusion: Out of the five metrics considered for the study, METEOR shows the highest correlation with human scores as compared to the other metrics.

Download Full-text

Document-Level Machine Translation Evaluation Metrics Enhanced with Simplified Lexical Chain

Natural Language Processing and Chinese Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-25207-0_35 ◽

2015 ◽

pp. 396-403

Author(s):

Zhengxian Gong ◽

Guodong Zhou

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Machine Translation Evaluation ◽

Lexical Chain ◽

Document Level

Download Full-text

Lexical Chains meet Word Embeddings in Document-level Statistical Machine Translation

10.18653/v1/w17-4813 ◽

2017 ◽

Cited By ~ 4

Author(s):

Laura Mascarell

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Embeddings ◽

Lexical Chains ◽

Document Level

Download Full-text

Unsupervised Word Translation with Adversarial Autoencoder

Computational Linguistics ◽

10.1162/coli_a_00374 ◽

2020 ◽

Vol 46 (2) ◽

pp. 257-288

Author(s):

Tasnim Mohiuddin ◽

Shafiq Joty

Keyword(s):

Machine Translation ◽

Superior Performance ◽

Data Sets ◽

Word Embeddings ◽

Shared Space ◽

Parallel Data ◽

Adversarial Training ◽

Word Translation ◽

Input Reconstruction ◽

Adversarial Model

Crosslingual word embeddings learned from monolingual embeddings have a crucial role in many downstream tasks, ranging from machine translation to transfer learning. Adversarial training has shown impressive success in learning crosslingual embeddings and the associated word translation task without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this article, we investigate adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results. Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator. We use two types of refinement procedures sequentially after obtaining the trained encoders and mappings from the adversarial training, namely, refinement with Procrustes solution and refinement with symmetric re-weighting. Extensive experimentations with high- and low-resource languages from two different data sets show that our method achieves better performance than existing adversarial and non-adversarial approaches and is also competitive with the supervised system. Along with performing comprehensive ablation studies to understand the contribution of different components of our adversarial model, we also conduct a thorough analysis of the refinement procedures to understand their effects.

Download Full-text

The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification

Computational Linguistics ◽

10.1162/coli_a_00418 ◽

2021 ◽

pp. 1-29

Author(s):

Fernando Alva-Manchego ◽

Carolina Scarton ◽

Lucia Specia

Keyword(s):

Three Dimensions ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

System Type ◽

Text Simplification ◽

Sentence Level ◽

Complex Words ◽

Existing Data

Abstract In order to simplify sentences, several rewriting operations can be performed such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgements on the simplicity achieved by executing specific operations (e.g. simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgements. For that, we first collect a new and more reliable dataset for evaluating the correlation of metrics and human judgements of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new dataset (and other existing data) to analyse the variation of the correlation between metrics’ scores and human judgements across three dimensions: the perceived simplicity level, the system type and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly-used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.

Download Full-text