Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Abstract In recent years, the multilingual content over the internet has grown exponentially together with the evolution of the internet. The usage of multilingual content is excluded from the regional language users because of the language barrier. So, machine translation between languages is the only possible solution to make these contents available for regional language users. Machine translation is the process of translating a text from one language to another. The machine translation system has been investigated well already in English and other European languages. However, it is still a nascent stage for Indian languages. This paper presents an overview of the Machine Translation in Indian Languages shared task conducted on September 7–8, 2017, at Amrita Vishwa Vidyapeetham, Coimbatore, India. This machine translation shared task in Indian languages is mainly focused on the development of English-Tamil, English-Hindi, English-Malayalam and English-Punjabi language pairs. This shared task aims at the following objectives: (a) to examine the state-of-the-art machine translation systems when translating from English to Indian languages; (b) to investigate the challenges faced in translating between English to Indian languages; (c) to create an open-source parallel corpus for Indian languages, which is lacking. Evaluating machine translation output is another challenging task especially for Indian languages. In this shared task, we have evaluated the participant’s outputs with the help of human annotators. As far as we know, this is the first shared task which depends completely on the human evaluation.

Download Full-text

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00437 ◽

2021 ◽

Vol 9 ◽

pp. 1460-1474

Author(s):

Markus Freitag ◽

George Foster ◽

David Grangier ◽

Viresh Ratnakar ◽

Qijun Tan ◽

...

Keyword(s):

Machine Translation ◽

Large Scale ◽

Standard Procedure ◽

Difficult Problem ◽

Evaluation Methodology ◽

Evaluation Procedures ◽

Human Evaluation ◽

The One ◽

Human Crowd ◽

Translation Systems

Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

Download Full-text

Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles

10.18653/v1/w17-2004 ◽

2017 ◽

Author(s):

Iacer Calixto ◽

Daniel Stein ◽

Evgeny Matusov ◽

Sheila Castilho ◽

Andy Way

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Human Evaluation

Download Full-text

Neural machine translation of clinical texts between long distance languages

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz110 ◽

2019 ◽

Vol 26 (12) ◽

pp. 1478-1487 ◽

Cited By ~ 1

Author(s):

Xabier Soto ◽

Olatz Perez-de-Viñaspre ◽

Gorka Labaka ◽

Maite Oronoz

Keyword(s):

Machine Translation ◽

Snomed Ct ◽

Long Distance ◽

Health Records ◽

Neural Machine Translation ◽

Training Corpus ◽

Human Evaluation ◽

Back Translation ◽

Translation Systems ◽

Clinical Domain

Abstract Objective To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish. Materials and Methods We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation. Results We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations. Discussion We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates. Conclusion We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.

Download Full-text

Human evaluation of three machine translation systems: from quality to attitudes by professional translators

Vigo International Journal of Applied Linguistics ◽

10.35869/vial.v0i18.3366 ◽

2021 ◽

pp. 123-148

Author(s):

Anna Fernández Torné ◽

Anna Matamala

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Evaluation Process ◽

Translation System ◽

Neural Machine Translation ◽

System A ◽

Human Evaluation ◽

Machine Translation System ◽

Translation Systems ◽

Document Level

This article aims to compare three machine translation systems with a focus on human evaluation. The systems under analysis are a domain-adapted statistical machine translation system, a domain-adapted neural machine translation system and a generic machine translation system. The comparison is carried out on translation from Spanish into German with industrial documentation of machine tool components and processes. The focus is on the human evaluation of the machine translation output, specifically on: fluency, adequacy and ranking at the segment level; fluency, adequacy, need for post-editing, ease of post-editing, and mental effort required in post-editing at the document level; productivity (post-editing speed and post-editing effort) and attitudes. Emphasis is placed on human factors in the evaluation process.

Download Full-text

Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0018 ◽

2019 ◽

Vol 55 (2) ◽

pp. 491-515 ◽

Cited By ~ 1

Author(s):

Krzysztof Jassem ◽

Tomasz Dwojak

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Medium Size ◽

Computational Power ◽

Neural Machine Translation ◽

General Domain ◽

Domain Specific ◽

Human Evaluation ◽

Near Future

Abstract Neural Machine Translation (NMT) has recently achieved promising results for a number of translation pairs. Although the method requires larger volumes of data and more computational power than Statistical Machine Translation (SMT), it is believed to become dominant in near future. In this paper we evaluate SMT and NMT models learned on a domain-specific English-Polish corpus of a moderate size (1,200,000 segments). The experiment shows that both solutions significantly outperform a general-domain online translator. The SMT model achieves a slightly better BLEU score than the NMT model. On the other hand, the process of decoding is noticeably faster in NMT. Human evaluation carried out on a sizeable sample of translations (2,000 pairs) reveals the superiority of the NMT approach, particularly in the aspect of output fluency.

Download Full-text

Can machine translation systems be evaluated by the crowd alone

Natural Language Engineering ◽

10.1017/s1351324915000339 ◽

2015 ◽

Vol 23 (1) ◽

pp. 3-30 ◽

Cited By ~ 10

Author(s):

YVETTE GRAHAM ◽

TIMOTHY BALDWIN ◽

ALISTAIR MOFFAT ◽

JUSTIN ZOBEL

Keyword(s):

Machine Translation ◽

Large Scale ◽

Statistical Machine Translation ◽

Crowd Sourcing ◽

Direct Estimate ◽

Translation Quality ◽

Relative Preference ◽

Human Evaluation ◽

Estimate Method ◽

Translation Systems

AbstractCrowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

Download Full-text

Fine-Grained Human Evaluation of Neural Versus Phrase-Based Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0014 ◽

2017 ◽

Vol 108 (1) ◽

pp. 121-132 ◽

Cited By ~ 18

Author(s):

Filip Klubička ◽

Antonio Toral ◽

Víctor M. Sánchez-Cartagena

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Quality Metrics ◽

Fine Grained ◽

Human Evaluation ◽

Error Annotation ◽

Error Types

AbstractWe compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems’ outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.

Download Full-text

A Set of Recommendations for Assessing Human–Machine Parity in Language Translation

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11371 ◽

2020 ◽

Vol 67 ◽

Author(s):

Samuel Läubli ◽

Sheila Castilho ◽

Graham Neubig ◽

Rico Sennrich ◽

Qinlan Shen ◽

...

Keyword(s):

Best Practices ◽

Machine Translation ◽

Best Practice ◽

Language Translation ◽

Linguistic Context ◽

Evaluation Design ◽

The Past ◽

Human Evaluation ◽

Translation Systems

The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parity was owed to weaknesses in the evaluation design—which is currently considered best practice in the field. We show that the professional human translations contained significantly fewer errors, and that perceived quality in human evaluation depends on the choice of raters, the availability of linguistic context, and the creation of reference translations. Our results call for revisiting current best practices to assess strong machine translation systems in general and human–machine parity in particular, for which we offer a set of recommendations based on our empirical findings.

Download Full-text

Review for "Fine‐grained gravity flow deposits and their depositional processes: A case study from the Cretaceous Nenjiang Formation, Songliao Basin, NE China"

10.1002/gj.4017/v2/review2 ◽

2020 ◽

Keyword(s):

Songliao Basin ◽

Gravity Flow ◽

Fine Grained ◽

Depositional Processes

Download Full-text