An analysis of the effect of training data variation in English-Persian Statistical Machine Translation

Author(s):  
Mahsa Mohaghegh ◽  
Abdolhossein Sarrafzadeh
2010 ◽  
Vol 17 (3) ◽  
pp. 101-122 ◽  
Author(s):  
Eric Nichols ◽  
Francis Bond ◽  
D. Scott Appling ◽  
Yuji Matsumoto

2013 ◽  
Vol 39 (4) ◽  
pp. 1067-1108 ◽  
Author(s):  
Sara Stymne ◽  
Nicola Cancedda ◽  
Lars Ahrenberg

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.


2020 ◽  
pp. 1-22
Author(s):  
Sukanta Sen ◽  
Mohammed Hasanuzzaman ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya ◽  
Andy Way

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.


2019 ◽  
Vol 45 (2) ◽  
pp. 267-292 ◽  
Author(s):  
Akiko Eriguchi ◽  
Kazuma Hashimoto ◽  
Yoshimasa Tsuruoka

Neural machine translation (NMT) has shown great success as a new alternative to the traditional Statistical Machine Translation model in multiple languages. Early NMT models are based on sequence-to-sequence learning that encodes a sequence of source words into a vector space and generates another sequence of target words from the vector. In those NMT models, sentences are simply treated as sequences of words without any internal structure. In this article, we focus on the role of the syntactic structure of source sentences and propose a novel end-to-end syntactic NMT model, which we call a tree-to-sequence NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our proposed model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. We have empirically compared the proposed model with sequence-to-sequence models in various settings on Chinese-to-Japanese and English-to-Japanese translation tasks. Our experimental results suggest that the use of syntactic structure can be beneficial when the training data set is small, but is not as effective as using a bi-directional encoder. As the size of training data set increases, the benefits of using a syntactic tree tends to diminish.


2016 ◽  
Vol 42 (1) ◽  
pp. 121-161 ◽  
Author(s):  
Daniel Ortiz-Martínez

We present online learning techniques for statistical machine translation (SMT). The availability of large training data sets that grow constantly over time is becoming more and more frequent in the field of SMT—for example, in the context of translation agencies or the daily translation of government proceedings. When new knowledge is to be incorporated in the SMT models, the use of batch learning techniques require very time-consuming estimation processes over the whole training set that may take days or weeks to be executed. By means of the application of online learning, new training samples can be processed individually in real time. For this purpose, we define a state-of-the-art SMT model composed of a set of submodels, as well as a set of incremental update rules for each of these submodels. To test our techniques, we have studied two well-known SMT applications that can be used in translation agencies: post-editing and interactive machine translation. In both scenarios, the SMT system collaborates with the user to generate high-quality translations. These user-validated translations can be used to extend the SMT models by means of online learning. Empirical results in the two scenarios under consideration show the great impact of frequent updates in the system performance. The time cost of such updates was also measured, comparing the efficiency of a batch learning SMT system with that of an online learning system, showing that online learning is able to work in real time whereas the time cost of batch retraining soon becomes infeasible. Empirical results also showed that the performance of online learning is comparable to that of batch learning. Moreover, the proposed techniques were able to learn from previously estimated models or from scratch. We also propose two new measures to predict the effectiveness of online learning in SMT tasks. The translation system with online learning capabilities presented here is implemented in the open-source Thot toolkit for SMT.


2016 ◽  
Vol 106 (1) ◽  
pp. 159-168 ◽  
Author(s):  
Julian Hitschler ◽  
Laura Jehl ◽  
Sariya Karimova ◽  
Mayumi Ohta ◽  
Benjamin Körner ◽  
...  

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.


2013 ◽  
Vol 284-287 ◽  
pp. 3325-3329
Author(s):  
Long Yue Wang ◽  
Derek F. Wong ◽  
Lidia S. Chao

This paper presents a proposed Cross-Language Document Retrieval experimental platform integrated with preprocessing of training data, document translation, query generation, document retrieval and precision evaluation modules. Given a certain document in source language, it will be translated into target language by statistical machine translation module which is trained by selected training data. The query generation module then selects the most relevant words in the translated version of the document as searching query. After all the documents in the target language are ranked by the document retrieval module, the system will choose the N-best documents as its target language versions. Finally, the results can be evaluated by precision evaluator, which can reflect the merits of the strategies. Experimental results showed that this platform was effective and achieved very good performance.


2012 ◽  
Vol 44 ◽  
pp. 179-222 ◽  
Author(s):  
P. Nakov ◽  
H. T. Ng

We propose a novel language-independent approach for improving machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X_1 into a resource-rich language Y given a bi-text containing a limited number of parallel sentences for X_1-Y and a larger bi-text for X_2-Y for some resource-rich language X_2 that is closely related to X_1. This is achieved by taking advantage of the opportunities that vocabulary overlap and similarities between the languages X_1 and X_2 in spelling, word order, and syntax offer: (1) we improve the word alignments for the resource-poor language, (2) we further augment it with additional translation options, and (3) we take care of potential spelling differences through appropriate transliteration. The evaluation for Indonesian- >English using Malay and for Spanish -> English using Portuguese and pretending Spanish is resource-poor shows an absolute gain of up to 1.35 and 3.37 BLEU points, respectively, which is an improvement over the best rivaling approaches, while using much less additional data. Overall, our method cuts the amount of necessary "real'' training data by a factor of 2--5.


2013 ◽  
Vol 99 (1) ◽  
pp. 17-38
Author(s):  
Matthias Huck ◽  
Erik Scharwächter ◽  
Hermann Ney

Abstract Standard phrase-based statistical machine translation systems generate translations based on an inventory of continuous bilingual phrases. In this work, we extend a phrase-based decoder with the ability to make use of phrases that are discontinuous in the source part. Our dynamic programming beam search algorithm supports separate pruning of coverage hypotheses per cardinality and of lexical hypotheses per coverage, as well as coverage constraints that impose restrictions on the possible reorderings. In addition to investigating these aspects, which are related to the decoding procedure, we also concentrate our attention on the question of how to obtain source-side discontinuous phrases from parallel training data. Two approaches (hierarchical and discontinuous extraction) are presented and compared. On a large-scale Chinese!English translation task, we conduct a thorough empirical evaluation in order to study a number of system configurations with source-side discontinuous phrases, and to compare them to setups which employ continuous phrases only.


Author(s):  
Hoang Cuong ◽  
Khalil Sima’an ◽  
Ivan Titov

Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.


Sign in / Sign up

Export Citation Format

Share Document