Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages

We propose a novel language-independent approach for improving machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X_1 into a resource-rich language Y given a bi-text containing a limited number of parallel sentences for X_1-Y and a larger bi-text for X_2-Y for some resource-rich language X_2 that is closely related to X_1. This is achieved by taking advantage of the opportunities that vocabulary overlap and similarities between the languages X_1 and X_2 in spelling, word order, and syntax offer: (1) we improve the word alignments for the resource-poor language, (2) we further augment it with additional translation options, and (3) we take care of potential spelling differences through appropriate transliteration. The evaluation for Indonesian- >English using Malay and for Spanish -> English using Portuguese and pretending Spanish is resource-poor shows an absolute gain of up to 1.35 and 3.37 BLEU points, respectively, which is an improvement over the best rivaling approaches, while using much less additional data. Overall, our method cuts the amount of necessary "real'' training data by a factor of 2--5.

Download Full-text

An Experimental Platform for Cross-Language Document Retrieval

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.284-287.3325 ◽

2013 ◽

Vol 284-287 ◽

pp. 3325-3329

Author(s):

Long Yue Wang ◽

Derek F. Wong ◽

Lidia S. Chao

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Document Retrieval ◽

Training Data ◽

Target Language ◽

Source Language ◽

Experimental Platform ◽

Precision Evaluation ◽

Query Generation ◽

Cross Language

This paper presents a proposed Cross-Language Document Retrieval experimental platform integrated with preprocessing of training data, document translation, query generation, document retrieval and precision evaluation modules. Given a certain document in source language, it will be translated into target language by statistical machine translation module which is trained by selected training data. The query generation module then selects the most relevant words in the translated version of the document as searching query. After all the documents in the target language are ranked by the document retrieval module, the system will choose the N-best documents as its target language versions. Finally, the results can be evaluated by precision evaluator, which can reflect the merits of the strategies. Experimental results showed that this platform was effective and achieved very good performance.

Download Full-text

Source Language Adaptation Approaches for Resource-Poor Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00248 ◽

2016 ◽

Vol 42 (2) ◽

pp. 277-306 ◽

Cited By ~ 8

Author(s):

Pidong Wang ◽

Preslav Nakov ◽

Hwee Tou Ng

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

World Languages ◽

Word Level ◽

Resource Poor ◽

Morphological Variants ◽

Cross Lingual ◽

Translation Systems

Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.

Download Full-text

Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation

Computational Linguistics ◽

10.1162/coli.2010.36.2.09054 ◽

2010 ◽

Vol 36 (2) ◽

pp. 247-277 ◽

Cited By ~ 11

Author(s):

Wei Wang ◽

Jonathan May ◽

Kevin Knight ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Syntactic Structure ◽

Statistical Machine Translation ◽

Training Data ◽

Word Alignment ◽

The Em Algorithm ◽

Rule Application ◽

Word Alignments ◽

Parse Trees

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.

Download Full-text

Word-Order Issues in English-to-Urdu Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-011-0007-0 ◽

2011 ◽

Vol 95 (1) ◽

pp. 87-106 ◽

Cited By ~ 3

Author(s):

Bushra Jawaid ◽

Daniel Zeman

Keyword(s):

Machine Translation ◽

Word Order ◽

Statistical Machine Translation ◽

Parse Tree ◽

Hard Problem ◽

Long Distance ◽

Translation Process ◽

English Sentence ◽

European Languages

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.

Download Full-text

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0025 ◽

2017 ◽

Vol 108 (1) ◽

pp. 257-269 ◽

Cited By ~ 4

Author(s):

Nasser Zalmout ◽

Nizar Habash

Keyword(s):

Machine Translation ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

Context Variable ◽

Significant Performance ◽

Morphologically Rich Languages ◽

Target Languages ◽

Language Text

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text

Paraphrasing Training Data for Statistical Machine Translation

Journal of Natural Language Processing ◽

10.5715/jnlp.17.3_101 ◽

2010 ◽

Vol 17 (3) ◽

pp. 101-122 ◽

Cited By ~ 2

Author(s):

Eric Nichols ◽

Francis Bond ◽

D. Scott Appling ◽

Yuji Matsumoto

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data

Download Full-text

An analysis of the effect of training data variation in English-Persian Statistical Machine Translation

2009 International Conference on Innovations in Information Technology (IIT) ◽

10.1109/iit.2009.5413782 ◽

2009 ◽

Author(s):

Mahsa Mohaghegh ◽

Abdolhossein Sarrafzadeh

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Data Variation

Download Full-text

Symmetric word alignments for statistical machine translation

10.3115/1220355.1220387 ◽

2004 ◽

Cited By ~ 15

Author(s):

Evgeny Matusov ◽

Richard Zens ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignments

Download Full-text

Generation of Compound Words in Statistical Machine Translation into Compounding Languages

Computational Linguistics ◽

10.1162/coli_a_00162 ◽

2013 ◽

Vol 39 (4) ◽

pp. 1067-1108 ◽

Cited By ~ 3

Author(s):

Sara Stymne ◽

Nicola Cancedda ◽

Lars Ahrenberg

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Translation Process ◽

New Methods ◽

Part Of Speech ◽

The Right ◽

Right Order ◽

Germanic Languages ◽

Direct Inspection

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Download Full-text

Knowledge Graphs Effectiveness in Neural Machine Translation Improvement

Computer Science ◽

10.7494/csci.2020.21.3.3701 ◽

2020 ◽

Vol 21 (3) ◽

Author(s):

Benyamin Ahmadnia ◽

Bonnie J. Dorr ◽

Parisa Kordjamshidi

Keyword(s):

Machine Translation ◽

Semantic Representation ◽

Language Translation ◽

Semantic Relations ◽

Training Data ◽

Target Language ◽

Neural Machine Translation ◽

Source Language ◽

Knowledge Graphs ◽

Unknown Words

Neural Machine Translation (NMT) systems require a massive amount of Maintaining semantic relations between words during the translation process yields more accurate target-language output from Neural Machine Translation (NMT). Although difficult to achieve from training data alone, it is possible to leverage Knowledge Graphs (KGs) to retain source-language semantic relations in the corresponding target-language translation. The core idea is to use KG entity relations as embedding constraints to improve the mapping from source to target. This paper describes two embedding constraints, both of which employ Entity Linking (EL)---assigning a unique identity to entities---to associate words in training sentences with those in the KG: (1) a monolingual embedding constraint that supports an enhanced semantic representation of the source words through access to relations between entities in a KG; and (2) a bilingual embedding constraint that forces entity relations in the source-language to be carried over to the corresponding entities in the target-language translation. The method is evaluated for English-Spanish translation exploiting Freebase as a source of knowledge. Our experimental results show that exploiting KG information not only decreases the number of unknown words in the translation but also improves translation quality.

Download Full-text