scholarly journals Automated MT evaluation metrics and their limitations

Author(s):  
Bogdan Babych
Author(s):  
Nora Aranberri-Monasterio ◽  
Sharon O‘Brien

-ing forms in English are reported to be problematic for Machine Transla-tion and are often the focus of rules in Controlled Language rule sets. We investigated how problematic -ing forms are for an RBMT system, translat-ing into four target languages in the IT domain. Constituent-based human evaluation was used and the results showed that, in general, -ing forms do not deserve their bad reputation. A comparison with the results of five automated MT evaluation metrics showed promising correlations. Some issues prevail, however, and can vary from target language to target lan-guage. We propose different strategies for dealing with these problems, such as Controlled Language rules, semi-automatic post-editing, source text tagging and “post-editing” the source text.


2017 ◽  
Author(s):  
Miloš Stanojević ◽  
Khalil Sima'an

2019 ◽  
Vol 45 (3) ◽  
pp. 515-558
Author(s):  
Marina Fomicheva ◽  
Lucia Specia

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.


Author(s):  
Paula Estrella ◽  
Andrei Popescu-Belis ◽  
Maghi King

A large number of evaluation metrics exist for machine translation (MT) systems, but depending on the intended context of use of such a system, not all metrics are equally relevant. Based on the ISO/IEC 9126 and 14598 standards for software evaluation, the Framework for the Evaluation of Machine Translation in ISLE (FEMTI) provides guidelines for the selection of quality characteristics to be evaluated depending on the expected task, users, and input characteristics of an MT system. This approach to contextual evaluation was implemented as a web-based application which helps its users design evaluation plans. In addition, FEMTI offers experts in evaluation the possibility to enter and share their knowledge using a dedicated web-based tool, tested in several evaluation exercises.


2018 ◽  
Vol 34 (5) ◽  
pp. 3225-3233 ◽  
Author(s):  
Michal Munk ◽  
Dasa Munkova ◽  
Lubomir Benko

2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Ch Ram Anirudh ◽  
Kavi Narayana Murthy

Machine Translated texts are often far from perfect and postediting is essential to get publishable quality. Post-editing may not always be a pleasant task. However, modern machine translation (MT) approaches like Statistical MT (SMT) and Neural MT (NMT) seem to hold greater promise. In this work, we present a quantitative method for scoring translations and computing the post-editability of MT system outputs.We show that the scores we get correlate well with MT evaluation metrics as also with the actual time and effort required for post-editing. We compare the outputs of three modern MT systems namely phrase-based SMT (PBMT), NMT, and Google translate for their Post-Editability for English to Hindi translation. Further, we explore the effect of various kinds of errors in MT outputs on postediting time and effort. Including an Indian language in this kind of post-editability study and analyzing the influence oferrors on postediting time and effort for NMT are highlights of this work.


Sign in / Sign up

Export Citation Format

Share Document