The Operation Sequence Model—Combining N-Gram-Based and Phrase-Based Statistical Machine Translation

In this article, we present a novel machine translation model, the Operation Sequence Model (OSM), which combines the benefits of phrase-based and N-gram-based statistical machine translation (SMT) and remedies their drawbacks. The model represents the translation process as a linear sequence of operations. The sequence includes not only translation operations but also reordering operations. As in N-gram-based SMT, the model is: (i) based on minimal translation units, (ii) takes both source and target information into account, (iii) does not make a phrasal independence assumption, and (iv) avoids the spurious phrasal segmentation problem. As in phrase-based SMT, the model (i) has the ability to memorize lexical reordering triggers, (ii) builds the search graph dynamically, and (iii) decodes with large translation units during search. The unique properties of the model are (i) its strong coupling of reordering and translation where translation and reordering decisions are conditioned on n previous translation and reordering decisions, and (ii) the ability to model local and long-range reorderings consistently. Using BLEU as a metric of translation accuracy, we found that our system performs significantly better than state-of-the-art phrase-based systems (Moses and Phrasal) and N-gram-based systems (Ncode) on standard translation tasks. We compare the reordering component of the OSM to the Moses lexical reordering model by integrating it into Moses. Our results show that OSM outperforms lexicalized reordering on all translation tasks. The translation quality is shown to be improved further by learning generalized representations with a POS-based OSM.

Download Full-text

Machine Learning Approaches for Bangla Statistical Machine Translation

Technical Challenges and Design Issues in Bangla Language Processing ◽

10.4018/978-1-4666-3970-6.ch004 ◽

2013 ◽

pp. 79-95

Author(s):

Maxim Roy

Keyword(s):

Machine Learning ◽

Active Learning ◽

Machine Translation ◽

Language Processing ◽

Statistical Machine Translation ◽

Low Density ◽

Learning Approaches ◽

Translation Quality ◽

Selection Strategies ◽

Translation Accuracy

Machine Translation (MT) from Bangla to English has recently become a priority task for the Bangla Natural Language Processing (NLP) community. Statistical Machine Translation (SMT) systems require a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bangla. In this chapter, the authors discuss how machine learning approaches can help to improve translation quality within as SMT system without requiring a huge increase in resources. They provide a novel semi-supervised learning and active learning framework for SMT, which utilizes both labeled and unlabeled data. The authors discuss sentence selection strategies in detail and perform detailed experimental evaluations on the sentence selection methods. In semi-supervised settings, reversed model approach outperformed all other approaches for Bangla-English SMT, and in active learning setting, geometric 4-gram and geometric phrase sentence selection strategies proved most useful based on BLEU score results over baseline approaches. Overall, in this chapter, the authors demonstrate that for low-density language like Bangla, these machine-learning approaches can improve translation quality.

Download Full-text

Productivity and quality when editing machine translation and translation memory outputs: an empirical analysis of English to Welsh translation

Studia Celtica Posnaniensia ◽

10.1515/scp-2017-0007 ◽

2017 ◽

Vol 2 (119) ◽

pp. 142-24

Author(s):

Benjamin Screen

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Final Analysis ◽

Controlled Study ◽

Time Data ◽

Translation Process ◽

Translation Memory ◽

Translation Quality ◽

Speed Up ◽

Segment 8

AbstractThis article reports on a controlled study carried out to examine the possible benefits of editing Machine Translation and Translation Memory outputs when translating from English to Welsh. Using software capable of timing the translation process per segment, 8 professional translators each translated 75 sentences of differing match percentage, and post- edited a further 25 segments of Machine Translation. Basing the final analysis on 800 sentences and 17,440 words, the use of Fuzzy Matches in the 70-99% match range, Exact Matches and Statistical Machine Translation was found to significantly speed up the translation process. Significant correlations were also found between the processing time data of Exact Matches and Machine Translation post-editing, rather than between Fuzzy Matches and Machine Translation as expected. Two experienced translators were then asked to rate all translations for fidelity, grammaticality and style, whereby it was found that the use of translation technology either did not negatively affect translation quality compared to manual translation, or its use actually improved final quality in some cases. As well as confirming the findings of research in relation to translation technology, these findings also contradict supposed similarities between translation quality in terms of style and post-editing Machine Translation.

Download Full-text

Dynamic Models in Moses for Online Adaptation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0001 ◽

2014 ◽

Vol 101 (1) ◽

pp. 7-28 ◽

Cited By ~ 2

Author(s):

Nicola Bertoldi

Keyword(s):

Machine Translation ◽

Dynamic Models ◽

Statistical Machine Translation ◽

Scoring Function ◽

External Information ◽

Computer Assisted ◽

Context Aware ◽

Translation Process ◽

Translation Quality ◽

Online Adaptation

Abstract A very hot issue for research and industry is how to effectively integrate machine translation (MT) within computer assisted translation (CAT) software. This paper focuses on this issue, and more generally how to dynamically adapt phrase-based statistical machine translation (SMT) by exploiting external knowledge, like the post-editions from professional translators. We present an enhancement of the Moses SMT toolkit dynamically adaptable to external information, which becomes available during the translation process, and which can depend on the previously translated text. We have equipped Moses with two new elements: a new phrase table implementation and a new LM-like feature. Both the phrase table and the LM-like feature can be dynamically modified by adding and removing entries and re-scoring them according to a time-decaying scoring function. The final goal of these two dynamically adaptable features is twofold: to create additional translation alternatives and to reward those which are composed of entries previously inserted therein. The implemented dynamic system is highly configurable, flexible and applicable to many tasks, like for instance online MT adaptation, interactive MT, and context-aware MT. When exploited in a real-world CAT scenario where online adaptation is applied to repetitive texts, it has proven itself very effective in improving translation quality and reducing post-editing effort.

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

A Survey on Document-level Neural Machine Translation

ACM Computing Surveys ◽

10.1145/3441691 ◽

2021 ◽

Vol 54 (2) ◽

pp. 1-36

Author(s):

Sameen Maruf ◽

Fahimeh Saleh ◽

Gholamreza Haffari

Keyword(s):

Machine Translation ◽

Language Processing ◽

Research Field ◽

Translation Process ◽

Future Directions ◽

Translation Quality ◽

Current State ◽

Evaluation Strategies ◽

Almost All ◽

Document Level

Machine translation (MT) is an important task in natural language processing (NLP), as it automates the translation process and reduces the reliance on human translators. With the resurgence of neural networks, the translation quality surpasses that of the translations obtained using statistical techniques for most language-pairs. Up until a few years ago, almost all of the neural translation models translated sentences independently , without incorporating the wider document-context and inter-dependencies among the sentences. The aim of this survey article is to highlight the major works that have been undertaken in the space of document-level machine translation after the neural revolution, so researchers can recognize the current state and future directions of this field. We provide an organization of the literature based on novelties in modelling and architectures as well as training and decoding strategies. In addition, we cover evaluation strategies that have been introduced to account for the improvements in document MT, including automatic metrics and discourse-targeted test sets. We conclude by presenting possible avenues for future exploration in this research field.

Download Full-text

Word-Order Issues in English-to-Urdu Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-011-0007-0 ◽

2011 ◽

Vol 95 (1) ◽

pp. 87-106 ◽

Cited By ~ 3

Author(s):

Bushra Jawaid ◽

Daniel Zeman

Keyword(s):

Machine Translation ◽

Word Order ◽

Statistical Machine Translation ◽

Parse Tree ◽

Hard Problem ◽

Long Distance ◽

Translation Process ◽

English Sentence ◽

European Languages

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.

Download Full-text

Neural Networks Classifier for Data Selection in Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0027 ◽

2017 ◽

Vol 108 (1) ◽

pp. 283-294 ◽

Cited By ~ 1

Author(s):

Álvaro Peris ◽

Mara Chinea-Ríos ◽

Francisco Casacuberta

Keyword(s):

Neural Networks ◽

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Data Selection ◽

Target Domain ◽

Translation Quality ◽

Bilingual Corpora ◽

Proper Estimation ◽

Adaptation Field

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.

Download Full-text

Generation of Compound Words in Statistical Machine Translation into Compounding Languages

Computational Linguistics ◽

10.1162/coli_a_00162 ◽

2013 ◽

Vol 39 (4) ◽

pp. 1067-1108 ◽

Cited By ~ 3

Author(s):

Sara Stymne ◽

Nicola Cancedda ◽

Lars Ahrenberg

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Translation Process ◽

New Methods ◽

Part Of Speech ◽

The Right ◽

Right Order ◽

Germanic Languages ◽

Direct Inspection

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Download Full-text

Evaluating Machine Translation Quality Using Short Segments Annotations

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2015-0005 ◽

2015 ◽

Vol 103 (1) ◽

pp. 85-110

Author(s):

Matouš Macháček ◽

Ondřej Bojar

Keyword(s):

Machine Translation ◽

Evaluation Method ◽

Statistical Machine Translation ◽

Translation System ◽

Translation Quality ◽

Machine Translation System

Abstract We propose a manual evaluation method for machine translation (MT), in which annotators rank only translations of short segments instead of whole sentences. This results in an easier and more efficient annotation. We have conducted an annotation experiment and evaluated a set of MT systems using this method. The obtained results are very close to the official WMT14 evaluation results. We also use the collected database of annotations to automatically evaluate new, unseen systems and to tune parameters of a statistical machine translation system. The evaluation of unseen systems, however, does not work and we analyze the reasons

Download Full-text

Topic-Based Dissimilarity and Sensitivity Models for Translation Rule Selection

Journal of Artificial Intelligence Research ◽

10.1613/jair.4265 ◽

2014 ◽

Vol 50 ◽

pp. 1-30 ◽

Cited By ~ 3

Author(s):

M. Zhang ◽

X. Xiao ◽

D. Xiong ◽

Q. Liu

Keyword(s):

Machine Translation ◽

Topic Model ◽

Statistical Machine Translation ◽

Model Space ◽

Target Language ◽

Translation Quality ◽

Rule Selection ◽

Translation Rule ◽

Selection Experiments ◽

Target Side

Translation rule selection is a task of selecting appropriate translation rules for an ambiguous source-language segment. As translation ambiguities are pervasive in statistical machine translation, we introduce two topic-based models for translation rule selection which incorporates global topic information into translation disambiguation. We associate each synchronous translation rule with source- and target-side topic distributions.With these topic distributions, we propose a topic dissimilarity model to select desirable (less dissimilar) rules by imposing penalties for rules with a large value of dissimilarity of their topic distributions to those of given documents. In order to encourage the use of non-topic specific translation rules, we also present a topic sensitivity model to balance translation rule selection between generic rules and topic-specific rules. Furthermore, we project target-side topic distributions onto the source-side topic model space so that we can benefit from topic information of both the source and target language. We integrate the proposed topic dissimilarity and sensitivity model into hierarchical phrase-based machine translation for synchronous translation rule selection. Experiments show that our topic-based translation rule selection model can substantially improve translation quality.

Download Full-text