scholarly journals Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

2018 ◽  
Vol 32 (3) ◽  
pp. 195-215 ◽  
Author(s):  
Filip Klubička ◽  
Antonio Toral ◽  
Víctor M. Sánchez-Cartagena
2019 ◽  
Vol 28 (3) ◽  
pp. 455-464 ◽  
Author(s):  
M. Anand Kumar ◽  
B. Premjith ◽  
Shivkaran Singh ◽  
S. Rajendran ◽  
K. P. Soman

Abstract In recent years, the multilingual content over the internet has grown exponentially together with the evolution of the internet. The usage of multilingual content is excluded from the regional language users because of the language barrier. So, machine translation between languages is the only possible solution to make these contents available for regional language users. Machine translation is the process of translating a text from one language to another. The machine translation system has been investigated well already in English and other European languages. However, it is still a nascent stage for Indian languages. This paper presents an overview of the Machine Translation in Indian Languages shared task conducted on September 7–8, 2017, at Amrita Vishwa Vidyapeetham, Coimbatore, India. This machine translation shared task in Indian languages is mainly focused on the development of English-Tamil, English-Hindi, English-Malayalam and English-Punjabi language pairs. This shared task aims at the following objectives: (a) to examine the state-of-the-art machine translation systems when translating from English to Indian languages; (b) to investigate the challenges faced in translating between English to Indian languages; (c) to create an open-source parallel corpus for Indian languages, which is lacking. Evaluating machine translation output is another challenging task especially for Indian languages. In this shared task, we have evaluated the participant’s outputs with the help of human annotators. As far as we know, this is the first shared task which depends completely on the human evaluation.


2021 ◽  
Vol 9 ◽  
pp. 1460-1474
Author(s):  
Markus Freitag ◽  
George Foster ◽  
David Grangier ◽  
Viresh Ratnakar ◽  
Qijun Tan ◽  
...  

Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.


2017 ◽  
Author(s):  
Iacer Calixto ◽  
Daniel Stein ◽  
Evgeny Matusov ◽  
Sheila Castilho ◽  
Andy Way

2019 ◽  
Vol 26 (12) ◽  
pp. 1478-1487 ◽  
Author(s):  
Xabier Soto ◽  
Olatz Perez-de-Viñaspre ◽  
Gorka Labaka ◽  
Maite Oronoz

Abstract Objective To analyze techniques for machine translation of electronic health records (EHRs) between long distance languages, using Basque and Spanish as a reference. We studied distinct configurations of neural machine translation systems and used different methods to overcome the lack of a bilingual corpus of clinical texts or health records in Basque and Spanish. Materials and Methods We trained recurrent neural networks on an out-of-domain corpus with different hyperparameter values. Subsequently, we used the optimal configuration to evaluate machine translation of EHR templates between Basque and Spanish, using manual translations of the Basque templates into Spanish as a standard. We successively added to the training corpus clinical resources, including a Spanish-Basque dictionary derived from resources built for the machine translation of the Spanish edition of SNOMED CT into Basque, artificial sentences in Spanish and Basque derived from frequently occurring relationships in SNOMED CT, and Spanish monolingual EHRs. Apart from calculating bilingual evaluation understudy (BLEU) values, we tested the performance in the clinical domain by human evaluation. Results We achieved slight improvements from our reference system by tuning some hyperparameters using an out-of-domain bilingual corpus, obtaining 10.67 BLEU points for Basque-to-Spanish clinical domain translation. The inclusion of clinical terminology in Spanish and Basque and the application of the back-translation technique on monolingual EHRs significantly improved the performance, obtaining 21.59 BLEU points. This was confirmed by the human evaluation performed by 2 clinicians, ranking our machine translations close to the human translations. Discussion We showed that, even after optimizing the hyperparameters out-of-domain, the inclusion of available resources from the clinical domain and applied methods were beneficial for the described objective, managing to obtain adequate translations of EHR templates. Conclusion We have developed a system which is able to properly translate health record templates from Basque to Spanish without making use of any bilingual corpus of clinical texts or health records.


Author(s):  
Anna Fernández Torné ◽  
Anna Matamala

This article aims to compare three machine translation systems with a focus on human evaluation. The systems under analysis are a domain-adapted statistical machine translation system, a domain-adapted neural machine translation system and a generic machine translation system. The comparison is carried out on translation from Spanish into German with industrial documentation of machine tool components and processes. The focus is on the human evaluation of the machine translation output, specifically on: fluency, adequacy and ranking at the segment level; fluency, adequacy, need for post-editing, ease of post-editing, and mental effort required in post-editing at the document level; productivity (post-editing speed and post-editing effort) and attitudes. Emphasis is placed on human factors in the evaluation process.


2019 ◽  
Vol 55 (2) ◽  
pp. 491-515 ◽  
Author(s):  
Krzysztof Jassem ◽  
Tomasz Dwojak

Abstract Neural Machine Translation (NMT) has recently achieved promising results for a number of translation pairs. Although the method requires larger volumes of data and more computational power than Statistical Machine Translation (SMT), it is believed to become dominant in near future. In this paper we evaluate SMT and NMT models learned on a domain-specific English-Polish corpus of a moderate size (1,200,000 segments). The experiment shows that both solutions significantly outperform a general-domain online translator. The SMT model achieves a slightly better BLEU score than the NMT model. On the other hand, the process of decoding is noticeably faster in NMT. Human evaluation carried out on a sizeable sample of translations (2,000 pairs) reveals the superiority of the NMT approach, particularly in the aspect of output fluency.


2015 ◽  
Vol 23 (1) ◽  
pp. 3-30 ◽  
Author(s):  
YVETTE GRAHAM ◽  
TIMOTHY BALDWIN ◽  
ALISTAIR MOFFAT ◽  
JUSTIN ZOBEL

AbstractCrowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.


2017 ◽  
Vol 108 (1) ◽  
pp. 121-132 ◽  
Author(s):  
Filip Klubička ◽  
Antonio Toral ◽  
Víctor M. Sánchez-Cartagena

AbstractWe compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems’ outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.


2020 ◽  
Vol 67 ◽  
Author(s):  
Samuel Läubli ◽  
Sheila Castilho ◽  
Graham Neubig ◽  
Rico Sennrich ◽  
Qinlan Shen ◽  
...  

The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parity was owed to weaknesses in the evaluation design—which is currently considered best practice in the field. We show that the professional human translations contained significantly fewer errors, and that perceived quality in human evaluation depends on the choice of raters, the availability of linguistic context, and the creation of reference translations. Our results call for revisiting current best practices to assess strong machine translation systems in general and human–machine parity in particular, for which we offer a set of recommendations based on our empirical findings.  


Sign in / Sign up

Export Citation Format

Share Document