On Post-Editability of Machine Translated Texts

Machine Translated texts are often far from perfect and postediting is essential to get publishable quality. Post-editing may not always be a pleasant task. However, modern machine translation (MT) approaches like Statistical MT (SMT) and Neural MT (NMT) seem to hold greater promise. In this work, we present a quantitative method for scoring translations and computing the post-editability of MT system outputs.We show that the scores we get correlate well with MT evaluation metrics as also with the actual time and effort required for post-editing. We compare the outputs of three modern MT systems namely phrase-based SMT (PBMT), NMT, and Google translate for their Post-Editability for English to Hindi translation. Further, we explore the effect of various kinds of errors in MT outputs on postediting time and effort. Including an Indian language in this kind of post-editability study and analyzing the influence oferrors on postediting time and effort for NMT are highlights of this work.

Download Full-text

Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments

Computational Linguistics ◽

10.1162/coli_a_00356 ◽

2019 ◽

Vol 45 (3) ◽

pp. 515-558

Author(s):

Marina Fomicheva ◽

Lucia Specia

Keyword(s):

Evaluation Study ◽

Evaluation Metrics ◽

Local Dependency ◽

Translation Quality ◽

Global Correlation ◽

Mt Evaluation ◽

Statistical Mt ◽

Wide Range ◽

The Difference ◽

Different Levels

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

Download Full-text

The FEMTI guidelines for contextual MT evaluation: principles and resources

Linguistica Antverpiensia, New Series – Themes in Translation Studies ◽

10.52034/lanstts.v8i.244 ◽

2021 ◽

Vol 8 ◽

Author(s):

Paula Estrella ◽

Andrei Popescu-Belis ◽

Maghi King

Keyword(s):

Machine Translation ◽

Quality Characteristics ◽

Evaluation Metrics ◽

Design Evaluation ◽

Software Evaluation ◽

Web Based ◽

Mt Evaluation ◽

Context Of Use ◽

Contextual Evaluation ◽

Selection Of

A large number of evaluation metrics exist for machine translation (MT) systems, but depending on the intended context of use of such a system, not all metrics are equally relevant. Based on the ISO/IEC 9126 and 14598 standards for software evaluation, the Framework for the Evaluation of Machine Translation in ISLE (FEMTI) provides guidelines for the selection of quality characteristics to be evaluated depending on the expected task, users, and input characteristics of an MT system. This approach to contextual evaluation was implemented as a web-based application which helps its users design evaluation plans. In addition, FEMTI offers experts in evaluation the possibility to enter and share their knowledge using a dedicated web-based tool, tested in several evaluation exercises.

Download Full-text

Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0011 ◽

2017 ◽

Vol 108 (1) ◽

pp. 85-96 ◽

Cited By ~ 2

Author(s):

Eva Martínez Garcia ◽

Carles Creus ◽

Cristina España-Bonet ◽

Lluís Màrquez

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

Word Embeddings ◽

Standard Document ◽

Sentence Level ◽

Word Translation ◽

Stochastic Mechanism ◽

Document Level

Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.

Download Full-text

MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0016 ◽

2019 ◽

Vol 28 (3) ◽

pp. 447-453 ◽

Cited By ~ 5

Author(s):

Sainik Kumar Mahata ◽

Dipankar Das ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Data Sets ◽

Shared Task ◽

Automatic Translation ◽

External Data ◽

Statistical Mt

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

Download Full-text

Application of modern machine translation systems in teaching foreign languages

Journal of Physics Conference Series ◽

10.1088/1742-6596/1399/3/033124 ◽

2019 ◽

Vol 1399 ◽

pp. 033124

Author(s):

E V Fibikh ◽

N V Kuznetsova

Keyword(s):

Machine Translation ◽

Foreign Languages ◽

Translation Systems ◽

Modern Machine

Download Full-text

Function words in statistical machine-translated Chinese and original Chinese: A study into the translationese of machine translation systems

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy050 ◽

2018 ◽

Vol 34 (4) ◽

pp. 752-771

Author(s):

Chen-li Kuo

Keyword(s):

Machine Translation ◽

Attribute Selection ◽

Close Attention ◽

Function Words ◽

Rule Based ◽

Source Language ◽

Statistical Mt ◽

Chinese Texts ◽

Translation Systems

Abstract Statistical approaches have become the mainstream in machine translation (MT), for their potential in producing less rigid and more natural translations than rule-based approaches. However, on closer examination, the uses of function words between statistical machine-translated Chinese and the original Chinese are different, and such differences may be associated with translationese as discussed in translation studies. This article examines the distribution of Chinese function words in a comparable corpus consisting of MTs and the original Chinese texts extracted from Wikipedia. An attribute selection technique is used to investigate which types of function words are significant in discriminating between statistical machine-translated Chinese and the original texts. The results show that statistical MT overuses the most frequent function words, even when alternatives exist. To improve the quality of the end product, developers of MT should pay close attention to modelling Chinese conjunctions and adverbial function words. The results also suggest that machine-translated Chinese shares some characteristics with human-translated texts, including normalization and being influenced by the source language; however, machine-translated texts do not exhibit other characteristics of translationese such as explicitation.

Download Full-text

Real Time Machine Translation System for English to Indian language

2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) ◽

10.1109/icaccs48705.2020.9074265 ◽

2020 ◽

Author(s):

Raj Vyas ◽

Kirti Joshi ◽

Hitesh Sutar ◽

Tatwadarshi P. Nagarhalli

Keyword(s):

Real Time ◽

Machine Translation ◽

Translation System ◽

Time Machine ◽

Indian Language ◽

Machine Translation System

Download Full-text

Deep Learning-based Roman-Urdu to Urdu Transliteration

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421520017 ◽

2020 ◽

pp. 2152001

Author(s):

Mehreen Alam ◽

Sibt ul Hussain

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Research Problem ◽

Attention Mechanism ◽

Data Driven ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Source Language ◽

Data Driven Approach ◽

Modern Machine

Attention-based encoder-decoder models have superseded conventional techniques due to their unmatched performance on many neural machine translation problems. Usually, the encoders and decoders are two recurrent neural networks where the decoder is directed to focus on relevant parts of the source language using attention mechanism. This data-driven approach leads to generic and scalable solutions with no reliance on manual hand-crafted features. To the best of our knowledge, none of the modern machine translation approaches has been applied to address the research problem of Urdu machine transliteration. Ours is the first attempt to apply the deep neural network-based encoder-decoder using attention mechanism to address the aforementioned problem using Roman-Urdu and Urdu parallel corpus. To this end, we present (i) the first ever Roman-Urdu to Urdu parallel corpus of 1.1 million sentences, (ii) three state of the art encoder-decoder models, and (iii) a detailed empirical analysis of these three models on the Roman-Urdu to Urdu parallel corpus. Overall, attention-based model gives state-of-the-art performance with the benchmark of 70 BLEU score. Our qualitative experimental evaluation shows that our models generate coherent transliterations which are grammatically and logically correct.

Download Full-text

Discourse Structure in Machine Translation Evaluation

Computational Linguistics ◽

10.1162/coli_a_00298 ◽

2017 ◽

Vol 43 (4) ◽

pp. 683-722 ◽

Cited By ~ 1

Author(s):

Shafiq Joty ◽

Francisco Guzmán ◽

Lluís Màrquez ◽

Preslav Nakov

Keyword(s):

Machine Translation ◽

Similarity Measures ◽

Discourse Structure ◽

System Level ◽

Structure Theory ◽

Evaluation Metrics ◽

Machine Translation Evaluation ◽

Sentence Level ◽

Relation Type ◽

Parse Trees

In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTK party. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.

Download Full-text

Domain Adaptation for Machine Translation with Instance Selection

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2015-0001 ◽

2015 ◽

Vol 103 (1) ◽

pp. 5-20 ◽

Cited By ~ 1

Author(s):

Ergun Biçici

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Test Sample ◽

Instance Selection ◽

Test Set ◽

Sample Distribution ◽

Sampling Process ◽

Statistical Mt ◽

Research Questions ◽

And Training

Abstract Domain adaptation for machine translation (MT) can be achieved by selecting training instances close to the test set from a larger set of instances. We consider 7 different domain adaptation strategies and answer 7 research questions, which give us a recipe for domain adaptation in MT. We perform English to German statistical MT (SMT) experiments in a setting where test and training sentences can come from different corpora and one of our goals is to learn the parameters of the sampling process. Domain adaptation with training instance selection can obtain 22% increase in target 2-gram recall and can gain up to 3:55 BLEU points compared with random selection. Domain adaptation with feature decay algorithm (FDA) not only achieves the highest target 2-gram recall and BLEU performance but also perfectly learns the test sample distribution parameter with correlation 0:99. Moses SMT systems built with FDA selected 10K training sentences is able to obtain F1 results as good as the baselines that use up to 2M sentences. Moses SMT systems built with FDA selected 50K training sentences is able to obtain F1 point better results than the baselines.

Download Full-text