scholarly journals COSTA MT Evaluation Tool: An Open Toolkit for Human Machine Translation Evaluation

2013 ◽  
Vol 100 (1) ◽  
pp. 83-89 ◽  
Author(s):  
Konstantinos Chatzitheodorou

Abstract A hotly debated topic in machine translation is human evaluation. On the one hand, it is extremely costly and time consuming; on the other, it is an important and unfortunately inevitable part of any system. This paper describes COSTA MT Evaluation Tool, an open stand-alone tool for human machine translation evaluation. It is a Java program that can be used to manually evaluate the quality of the machine translation output. It is simple in use, designed to allow machine translation potential users and developers to analyze their systems using a friendly environment. It enables the ranking of the quality of machine translation output segment-bysegment for a particular language pair. The benefits of this tool are multiple. Firstly, it is a rich repository of commonly used industry criteria (fluency, adequacy and translation error classification). Secondly, it is freely available to anyone and provides results that can be further analyzed. Thirdly, it estimates the time needed for each evaluated sentence. Finally, it gives suggestions about the fuzzy matching of the candidate translations.

2019 ◽  
Vol 26 (2) ◽  
pp. 137-161
Author(s):  
Eirini Chatzikoumi

AbstractThis article presents the most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output and provides the necessary background for MT evaluation projects. Evaluation is, as repeatedly admitted, highly relevant for the improvement of MT. This article is divided into three parts: the first one is dedicated to automated metrics; the second, to human metrics; and the last, to the challenges posed by neural machine translation (NMT) regarding its evaluation. The first part includes reference translation–based metrics; confidence or quality estimation (QE) metrics, which are used as alternatives for quality assessment; and diagnostic evaluation based on linguistic checkpoints. Human evaluation metrics are classified according to the criterion of whether human judges directly express a so-called subjective evaluation judgment, such as ‘good’ or ‘better than’, or not, as is the case in error classification. The former methods are based on directly expressed judgment (DEJ); therefore, they are called ‘DEJ-based evaluation methods’, while the latter are called ‘non-DEJ-based evaluation methods’. In the DEJ-based evaluation section, tasks such as fluency and adequacy annotation, ranking and direct assessment (DA) are presented, whereas in the non-DEJ-based evaluation section, tasks such as error classification and postediting are detailed, with definitions and guidelines, thus rendering this article a useful guide for evaluation projects. Following the detailed presentation of the previously mentioned metrics, the specificities of NMT are set forth along with suggestions for its evaluation, according to the latest studies. As human translators are the most adequate judges of the quality of a translation, emphasis is placed on the human metrics seen from a translator-judge perspective to provide useful methodology tools for interdisciplinary research groups that evaluate MT systems.


2018 ◽  
Vol 11 (1) ◽  
pp. 82-102
Author(s):  
Thiago Blanch Pires

ABSTRACT: This article aims at proposing an interdisciplinary approach involving the areas of Multimodality and Evaluation of Machine Translation to explore new configurations of text-image semantic relations generated by machine translation results. The methodology consists of a brief contextualization of the research problem, followed by the presentation and study of concepts and possibilities of Multimodality and Evaluation of Machine Translation, with an emphasis on the notion of intersemiotic texture, proposed by Liu and O'Halloran (2009), and a study of machine translation error classification, proposed by Vilar et. al. (2006). Finally, the article suggests some potentialities and limitations when combining the application of both areas of investigation.KEYWORDS: multimodality; machine translation; evaluation of machine translation; intersemiotic mismatches.RESUMO: Este artigo tem como objetivo propor uma abordagem interdisciplinar envolvendo as áreas da multimodalidade e da avaliação de tradução automática para explorar novas configurações de relações semânticas entre texto e imagem geradas por resultados de traduções automáticas. A metodologia é composta de uma breve contextualização sobre o problema de investigação, seguida da apresentação e do estudo de conceitos e possibilidades da multimodalidade e da avaliação de tradução automática, com destaque para os trabalhos respectivamente sobre textura intersemiótica proposta por Liu e O’Halloran (2009) e classificação de erros de máquinas de tradução proposta por Vilar et. al. (2006). Ao final, o estudo sugere algumas potencialidades e limitações no uso conjugado de ambas as áreas.PALAVRAS-CHAVE: multimodalidade; tradução automática; avaliação de tradução automática; incompatibilidades intersemióticas.


2020 ◽  
Vol 30 (01) ◽  
pp. 2050002
Author(s):  
Taichi Aida ◽  
Kazuhide Yamamoto

Current methods of neural machine translation may generate sentences with different levels of quality. Methods for automatically evaluating translation output from machine translation can be broadly classified into two types: a method that uses human post-edited translations for training an evaluation model, and a method that uses a reference translation that is the correct answer during evaluation. On the one hand, it is difficult to prepare post-edited translations because it is necessary to tag each word in comparison with the original translated sentences. On the other hand, users who actually employ the machine translation system do not have a correct reference translation. Therefore, we propose a method that trains the evaluation model without using human post-edited sentences and in the test set, estimates the quality of output sentences without using reference translations. We define some indices and predict the quality of translations with a regression model. For the quality of the translated sentences, we employ the BLEU score calculated from the number of word [Formula: see text]-gram matches between the translated sentence and the reference translation. After that, we compute the correlation between quality scores predicted by our method and BLEU actually computed from references. According to the experimental results, the correlation with BLEU is the highest when XGBoost uses all the indices. Moreover, looking at each index, we find that the sentence log-likelihood and the model uncertainty, which are based on the joint probability of generating the translated sentence, are important in BLEU estimation.


2016 ◽  
Vol 6 (1) ◽  
pp. 30-45
Author(s):  
Pankaj K. Goswami ◽  
Sanjay K. Dwivedi ◽  
C. K. Jha

English to Hindi translation of the computer-science related e-content, generated through an online freely available machine translation engine may not be technically correct. The expected target translation should be as fluent as intended for the native learners and the meaning of a source e-content should be conveyed properly. A Multi-Engine Machine Translation for English to Hindi Language (MEMTEHiL) framework has been designed and integrated by the authors as a translation solution for the computer science domain e-content. It was possible by enabling the use of well-tested approaches of machine translation. The humanly evaluated and acceptable metrics like fluency and adequacy (F&A) were used to assess the best translation quality for English to Hindi language pair. Besides humanly-judged metrics, another well-tested and existing interactive version of Bi-Lingual Evaluation Understudy (iBLEU) was used for evaluation. Authors have incorporated both parameters (F&A and iBLEU) for assessing the quality of translation as regenerated by the designed MEMTEHiL.


Author(s):  
Candy Lalrempuii ◽  
Badal Soni ◽  
Partha Pakray

Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends on the availability of a large parallel corpus. Although machine translation of many Indian languages has progressively gained attention, there is very limited research on machine translation and the challenges of using various machine translation techniques for a low-resource language such as Mizo. In this article, we have implemented and compared statistical-based approaches with modern neural-based approaches for the English–Mizo language pair. We have experimented with different tokenization methods, architectures, and configurations. The performance of translations predicted by the trained models has been evaluated using automatic and human evaluation measures. Furthermore, we have analyzed the prediction errors of the models and the quality of predictions based on variations in sentence length and compared the model performance with the existing baselines.


2021 ◽  
Vol 11 (7) ◽  
pp. 2948
Author(s):  
Lucia Benkova ◽  
Dasa Munkova ◽  
Ľubomír Benko ◽  
Michal Munk

This study is focused on the comparison of phrase-based statistical machine translation (SMT) systems and neural machine translation (NMT) systems using automatic metrics for translation quality evaluation for the language pair of English and Slovak. As the statistical approach is the predecessor of neural machine translation, it was assumed that the neural network approach would generate results with a better quality. An experiment was performed using residuals to compare the scores of automatic metrics of the accuracy (BLEU_n) of the statistical machine translation with those of the neural machine translation. The results showed that the assumption of better neural machine translation quality regardless of the system used was confirmed. There were statistically significant differences between the SMT and NMT in favor of the NMT based on all BLEU_n scores. The neural machine translation achieved a better quality of translation of journalistic texts from English into Slovak, regardless of if it was a system trained on general texts, such as Google Translate, or specific ones, such as the European Commission’s (EC’s) tool, which was trained on a specific-domain.


2021 ◽  
Vol 2 (1) ◽  
pp. 1-9
Author(s):  
Sajad Hussain Wani

Machine translation (MT) as a sub-field of computational linguistics represents one of the most advanced and applied translation dimensions as a research field. Translation divergence occurs when structurally similar sentences of the source language do not translate into sentences that are similar in structure in the target language" (Dorr, 1993). The sophistication in the domain of MT depends mainly on the identification of divergence patterns in a language pair. Many researchers in MT field including Dorr (1990, 1994) have emphasized that the best quality in MT can be achieved when an individual language pair in a particular context is described in detail. This paper attempts to explore the divergence patterns that characterize the translation of Kashmiri pronouns into English. The analysis in this paper has been restricted to the class of personal and possessive pronouns. Kashmiri has rich inflections and pronouns are marked for case, number, tense and gender and show complex agreement patterns. The paper identifies and outlines a wide variety of divergence patterns that characterize the Kashmiri English language pair. These divergence patterns are identified and summarized in order to improve the quality of the MT system that may be developed for Kashmiri English language pair in the near future and can also be utilized for other language pairs that are similar in terms of their structure and typological features.


Sign in / Sign up

Export Citation Format

Share Document