Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

Download Full-text

Can machine translation systems be evaluated by the crowd alone

Natural Language Engineering ◽

10.1017/s1351324915000339 ◽

2015 ◽

Vol 23 (1) ◽

pp. 3-30 ◽

Cited By ~ 10

Author(s):

YVETTE GRAHAM ◽

TIMOTHY BALDWIN ◽

ALISTAIR MOFFAT ◽

JUSTIN ZOBEL

Keyword(s):

Machine Translation ◽

Large Scale ◽

Statistical Machine Translation ◽

Crowd Sourcing ◽

Direct Estimate ◽

Translation Quality ◽

Relative Preference ◽

Human Evaluation ◽

Estimate Method ◽

Translation Systems

AbstractCrowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

Download Full-text

An Overview of the Shared Task on Machine Translation in Indian Languages (MTIL) – 2017

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0024 ◽

2019 ◽

Vol 28 (3) ◽

pp. 455-464 ◽

Cited By ~ 4

Author(s):

M. Anand Kumar ◽

B. Premjith ◽

Shivkaran Singh ◽

S. Rajendran ◽

K. P. Soman

Keyword(s):

Machine Translation ◽

The Internet ◽

Translation System ◽

Indian Languages ◽

Shared Task ◽

European Languages ◽

Regional Language ◽

Human Evaluation ◽

Machine Translation System ◽

Translation Systems

Abstract In recent years, the multilingual content over the internet has grown exponentially together with the evolution of the internet. The usage of multilingual content is excluded from the regional language users because of the language barrier. So, machine translation between languages is the only possible solution to make these contents available for regional language users. Machine translation is the process of translating a text from one language to another. The machine translation system has been investigated well already in English and other European languages. However, it is still a nascent stage for Indian languages. This paper presents an overview of the Machine Translation in Indian Languages shared task conducted on September 7–8, 2017, at Amrita Vishwa Vidyapeetham, Coimbatore, India. This machine translation shared task in Indian languages is mainly focused on the development of English-Tamil, English-Hindi, English-Malayalam and English-Punjabi language pairs. This shared task aims at the following objectives: (a) to examine the state-of-the-art machine translation systems when translating from English to Indian languages; (b) to investigate the challenges faced in translating between English to Indian languages; (c) to create an open-source parallel corpus for Indian languages, which is lacking. Evaluating machine translation output is another challenging task especially for Indian languages. In this shared task, we have evaluated the participant’s outputs with the help of human annotators. As far as we know, this is the first shared task which depends completely on the human evaluation.

Download Full-text

Neural machine translation of clinical texts between long distance languages

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz110 ◽

2019 ◽

Vol 26 (12) ◽

pp. 1478-1487 ◽

Cited By ~ 1

Author(s):

Xabier Soto ◽

Olatz Perez-de-Viñaspre ◽

Gorka Labaka ◽

Maite Oronoz

Keyword(s):

Machine Translation ◽

Snomed Ct ◽

Long Distance ◽

Health Records ◽

Neural Machine Translation ◽

Training Corpus ◽

Human Evaluation ◽

Back Translation ◽

Translation Systems ◽

Clinical Domain

Abstract Objective To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish. Materials and Methods We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation. Results We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations. Discussion We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates. Conclusion We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.

Download Full-text

Binarization of Synchronous Context-Free Grammars

Computational Linguistics ◽

10.1162/coli.2009.35.4.35406 ◽

2009 ◽

Vol 35 (4) ◽

pp. 559-595 ◽

Cited By ~ 16

Author(s):

Liang Huang ◽

Hao Zhang ◽

Daniel Gildea ◽

Kevin Knight

Keyword(s):

Machine Translation ◽

Large Scale ◽

Linear Time ◽

Statistical Machine Translation ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Difficult Problem ◽

Translation System ◽

Context Free ◽

Context Free Grammars

Systems based on synchronous grammars and tree transducers promise to improve the quality of statistical machine translation output, but are often very computationally intensive. The complexity is exponential in the size of individual grammar rules due to arbitrary re-orderings between the two languages. We develop a theory of binarization for synchronous context-free grammars and present a linear-time algorithm for binarizing synchronous rules when possible. In our large-scale experiments, we found that almost all rules are binarizable and the resulting binarized rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system. We also discuss the more general, and computationally more difficult, problem of finding good parsing strategies for non-binarizable rules, and present an approximate polynomial-time algorithm for this problem.

Download Full-text

Design and Testing of Automatic Machine Translation System Based on Chinese-English Phrase Translation

Mobile Information Systems ◽

10.1155/2021/3539155 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Jing Ning ◽

Haidong Ban

Keyword(s):

Machine Translation ◽

Language Processing ◽

Evaluation System ◽

Large Scale ◽

Automatic Machine ◽

Translation System ◽

Automatic Translation ◽

Translation Methods ◽

Translation Systems ◽

Design And Testing

With the development of linguistics and the improvement of computer performance, the effect of machine translation is getting better and better, and it is widely used. The automatic expression translation method based on the Chinese-English machine takes short sentences as the basic translation unit and makes full use of the order of short sentences. Compared with word-based statistical machine translation methods, the effect is greatly improved. The performance of machine translation is constantly improving. This article aims to study the design of phrase-based automatic machine translation systems by introducing machine translation methods and Chinese-English phrase translation, explore the design and testing of machine automatic translation systems based on the combination of Chinese-English phrase translation, and explain the role of machine automatic translation in promoting the development of translation. In this article, through the combination of machine translation experiments and machine automatic translation system design methods, the design and testing of machine automatic translation systems based on Chinese-English phrase translation combinations are studied to cultivate people's understanding of language, knowledge, and intelligence and then help solve other problems. Language processing issues promote the development of corpus linguistics. The experimental results in this article show that when the Chinese-English phrase translation probability table is changed from 82% to 51%, the BLEU translation evaluation system for the combination of Chinese-English phrases is improved. Automatic machine translation saves time and energy of translation work, which shows that machine translation shows its advantages due to its short development cycle and easy processing of large-scale corpora.

Download Full-text

Human evaluation of three machine translation systems: from quality to attitudes by professional translators

Vigo International Journal of Applied Linguistics ◽

10.35869/vial.v0i18.3366 ◽

2021 ◽

pp. 123-148

Author(s):

Anna Fernández Torné ◽

Anna Matamala

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Evaluation Process ◽

Translation System ◽

Neural Machine Translation ◽

System A ◽

Human Evaluation ◽

Machine Translation System ◽

Translation Systems ◽

Document Level

This article aims to compare three machine translation systems with a focus on human evaluation. The systems under analysis are a domain-adapted statistical machine translation system, a domain-adapted neural machine translation system and a generic machine translation system. The comparison is carried out on translation from Spanish into German with industrial documentation of machine tool components and processes. The focus is on the human evaluation of the machine translation output, specifically on: fluency, adequacy and ranking at the segment level; fluency, adequacy, need for post-editing, ease of post-editing, and mental effort required in post-editing at the document level; productivity (post-editing speed and post-editing effort) and attitudes. Emphasis is placed on human factors in the evaluation process.

Download Full-text

Source-Side Discontinuous Phrases for Machine Translation: A Comparative Study on Phrase Extraction and Search

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2013-0002 ◽

2013 ◽

Vol 99 (1) ◽

pp. 17-38

Author(s):

Matthias Huck ◽

Erik Scharwächter ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Large Scale ◽

Search Algorithm ◽

Statistical Machine Translation ◽

Empirical Evaluation ◽

Training Data ◽

Beam Search ◽

Phrase Extraction ◽

System Configurations ◽

Translation Systems

Abstract Standard phrase-based statistical machine translation systems generate translations based on an inventory of continuous bilingual phrases. In this work, we extend a phrase-based decoder with the ability to make use of phrases that are discontinuous in the source part. Our dynamic programming beam search algorithm supports separate pruning of coverage hypotheses per cardinality and of lexical hypotheses per coverage, as well as coverage constraints that impose restrictions on the possible reorderings. In addition to investigating these aspects, which are related to the decoding procedure, we also concentrate our attention on the question of how to obtain source-side discontinuous phrases from parallel training data. Two approaches (hierarchical and discontinuous extraction) are presented and compared. On a large-scale Chinese!English translation task, we conduct a thorough empirical evaluation in order to study a number of system configurations with source-side discontinuous phrases, and to compare them to setups which employ continuous phrases only.

Download Full-text

BIA: a Discriminative Phrase Alignment Toolkit

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0003-z ◽

2012 ◽

Vol 97 (1) ◽

pp. 43-53

Author(s):

Patrik Lambert ◽

Rafael Banchs

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignment ◽

Training Phase ◽

Initial Alignment ◽

The One ◽

Linear Alignment ◽

Phrase Alignment ◽

Translation Systems ◽

Better Than

BIA: a Discriminative Phrase Alignment Toolkit In most statistical machine translation systems, bilingual segments are extracted via word alignment. However, word alignment is performed independently from the requirements of the machine translation task. Furthermore, although phrase-based translation models have replaced word-based translation models nearly ten years ago, word-based models are still widely used for word alignment. In this paper we present the BIA (BIlingual Aligner) toolkit, a suite consisting of a discriminative phrase-based word alignment decoder based on linear alignment models, along with training and tuning tools. In the training phase, relative link probabilities are calculated based on an initial alignment. The tuning of the model weights may be performed directly according to machine translation metrics. We give implementation details and report results of experiments conducted on the Spanish-English Europarl task (with three corpus sizes), on the Chinese-English FBIS task, and on the Chinese-English BTEC task. The BLEU score obtained with BIA alignment is always as good or better than the one obtained with the initial alignment used to train BIA models. In addition, in four out of the five tasks, the BIA toolkit yields the best BLEU score of a collection of ten alignment systems. Finally, usage guidelines are presented.

Download Full-text

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Machine Translation ◽

10.1007/s10590-018-9214-x ◽

2018 ◽

Vol 32 (3) ◽

pp. 195-215 ◽

Cited By ~ 4

Author(s):

Filip Klubička ◽

Antonio Toral ◽

Víctor M. Sánchez-Cartagena

Keyword(s):

Machine Translation ◽

Fine Grained ◽

Human Evaluation ◽

Translation Systems

Download Full-text

A Set of Recommendations for Assessing Human–Machine Parity in Language Translation

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11371 ◽

2020 ◽

Vol 67 ◽

Author(s):

Samuel Läubli ◽

Sheila Castilho ◽

Graham Neubig ◽

Rico Sennrich ◽

Qinlan Shen ◽

...

Keyword(s):

Best Practices ◽

Machine Translation ◽

Best Practice ◽

Language Translation ◽

Linguistic Context ◽

Evaluation Design ◽

The Past ◽

Human Evaluation ◽

Translation Systems

The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parity was owed to weaknesses in the evaluation design—which is currently considered best practice in the field. We show that the professional human translations contained significantly fewer errors, and that perceived quality in human evaluation depends on the choice of raters, the availability of linguistic context, and the creation of reference translations. Our results call for revisiting current best practices to assess strong machine translation systems in general and human–machine parity in particular, for which we offer a set of recommendations based on our empirical findings.

Download Full-text