Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus

2019 ◽  
Vol 55 (2) ◽  
pp. 491-515 ◽  
Author(s):  
Krzysztof Jassem ◽  
Tomasz Dwojak

Abstract Neural Machine Translation (NMT) has recently achieved promising results for a number of translation pairs. Although the method requires larger volumes of data and more computational power than Statistical Machine Translation (SMT), it is believed to become dominant in near future. In this paper we evaluate SMT and NMT models learned on a domain-specific English-Polish corpus of a moderate size (1,200,000 segments). The experiment shows that both solutions significantly outperform a general-domain online translator. The SMT model achieves a slightly better BLEU score than the NMT model. On the other hand, the process of decoding is noticeably faster in NMT. Human evaluation carried out on a sizeable sample of translations (2,000 pairs) reveals the superiority of the NMT approach, particularly in the aspect of output fluency.

2017 ◽  
Author(s):  
Iacer Calixto ◽  
Daniel Stein ◽  
Evgeny Matusov ◽  
Sheila Castilho ◽  
Andy Way

Author(s):  
Anna Fernández Torné ◽  
Anna Matamala

This article aims to compare three machine translation systems with a focus on human evaluation. The systems under analysis are a domain-adapted statistical machine translation system, a domain-adapted neural machine translation system and a generic machine translation system. The comparison is carried out on translation from Spanish into German with industrial documentation of machine tool components and processes. The focus is on the human evaluation of the machine translation output, specifically on: fluency, adequacy and ranking at the segment level; fluency, adequacy, need for post-editing, ease of post-editing, and mental effort required in post-editing at the document level; productivity (post-editing speed and post-editing effort) and attitudes. Emphasis is placed on human factors in the evaluation process.


2021 ◽  
Author(s):  
Arthur T. Estrella ◽  
João B. O. Souza Filho

Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.


Author(s):  
Akshai Ramesh ◽  
Venkatesh Balavadhani Parthasarathy ◽  
Andy Way ◽  
Rejwanul Haque

Statistical machine translation (SMT) which was the dominant paradigm in machine translation (MT) research for nearly three decades has recently been superseded by the end-to-end deep learning approaches to MT. Although deep neural models produce state-of-the-art results in many translation tasks, they are found to under-perform on resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and NMT on two rarely-tested low-resource language-pairs, English-to-Tamil and Hindi-to-Tamil, taking a specialised data domain (software localisation) into consideration. This paper demonstrates our findings including the identification of several issues of the current neural approaches to low-resource domain-specific text translation and rankings of our MT systems via a social media platform-based human evaluation scheme.


Author(s):  
Guanhua Chen ◽  
Yun Chen ◽  
Yong Wang ◽  
Victor O.K. Li

Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we make constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones.


Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2017 ◽  
Vol 108 (1) ◽  
pp. 61-72
Author(s):  
Anita Ramm ◽  
Riccardo Superbo ◽  
Dimitar Shterionov ◽  
Tony O’Dowd ◽  
Alexander Fraser

AbstractWe present a multilingual preordering component tailored for a commercial Statistical Machine translation platform. In commercial settings, issues such as processing speed as well as the ability to adapt models to the customers’ needs play a significant role and have a big impact on the choice of approaches that are added to the custom pipeline to deal with specific problems such as long-range reorderings.We developed a fast and customisable preordering component, also available as an open-source tool, which comes along with a generic implementation that is restricted neither to the translation platform nor to the Machine Translation paradigm. We test preordering on three language pairs: English →Japanese/German/Chinese for both Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). Our experiments confirm previously reported improvements in the SMT output when the models are trained on preordered data, but they also show that preordering does not improve NMT.


2016 ◽  
Vol 5 (4) ◽  
pp. 51-66 ◽  
Author(s):  
Krzysztof Wolk ◽  
Krzysztof P. Marasek

The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. A comparison and implementation of a medical translator is the main focus of our experiments.


IEEE Access ◽  
2018 ◽  
Vol 6 ◽  
pp. 38512-38523 ◽  
Author(s):  
Quang-Phuoc Nguyen ◽  
Anh-Dung Vo ◽  
Joon-Choul Shin ◽  
Cheol-Young Ock

Sign in / Sign up

Export Citation Format

Share Document