Improving syntactic rule extraction through deleting spurious links with translation span alignment

This article presents a probabilistic sub-tree alignment model and its application to tree-to-tree machine translation. Unlike previous work, we do not resort to surface heuristics or expensive annotated data, but instead derive an unsupervised model to infer the syntactic correspondence between two languages. More importantly, the developed model is syntactically-motivated and does not rely on word alignments. As a by-product, our model outputs a sub-tree alignment matrix encoding a large number of diverse alignments between syntactic structures, from which machine translation systems can efficiently extract translation rules that are often filtered out due to the errors in 1-best alignment. Experimental results show that the proposed approach outperforms three state-of-the-art baseline approaches in both alignment accuracy and grammar quality. When applied to machine translation, our approach yields a +1.0 BLEU improvement and a -0.9 TER reduction on the NIST machine translation evaluation corpora. With tree binarization and fuzzy decoding, it even outperforms a state-of-the-art hierarchical phrase-based system.

Download Full-text

Extracting parallel phrases from comparable data for machine translation

Natural Language Engineering ◽

10.1017/s1351324916000139 ◽

2016 ◽

Vol 22 (4) ◽

pp. 549-573 ◽

Cited By ~ 3

Author(s):

SANJIKA HEWAVITHARANA ◽

STEPHAN VOGEL

Keyword(s):

Machine Translation ◽

Language Processing ◽

Statistical Machine Translation ◽

Word Alignment ◽

Data Set ◽

Comparable Corpora ◽

Alignment Algorithms ◽

Extraction Algorithm ◽

Phrase Alignment ◽

Translation Systems

AbstractMining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Download Full-text

Discriminative Word Alignment by Linear Modeling

Computational Linguistics ◽

10.1162/coli_a_00001 ◽

2010 ◽

Vol 36 (3) ◽

pp. 303-339 ◽

Cited By ~ 16

Author(s):

Yang Liu ◽

Qun Liu ◽

Shouxun Lin

Keyword(s):

State Of The Art ◽

Statistical Machine Translation ◽

Generative Models ◽

Target Language ◽

Word Alignment ◽

Alignment Quality ◽

Linear Modeling ◽

Parallel Text ◽

Bilingual Corpora ◽

Translation Systems

Word alignment plays an important role in many NLP tasks as it indicates the correspondence between words in a parallel text. Although widely used to align large bilingual corpora, generative models are hard to extend to incorporate arbitrary useful linguistic information. This article presents a discriminative framework for word alignment based on a linear model. Within this framework, all knowledge sources are treated as feature functions, which depend on a source language sentence, a target language sentence, and the alignment between them. We describe a number of features that could produce symmetric alignments. Our model is easy to extend and can be optimized with respect to evaluation metrics directly. The model achieves state-of-the-art alignment quality on three word alignment shared tasks for five language pairs with varying divergence and richness of resources. We further show that our approach improves translation performance for various statistical machine translation systems.

Download Full-text

BIA: a Discriminative Phrase Alignment Toolkit

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0003-z ◽

2012 ◽

Vol 97 (1) ◽

pp. 43-53

Author(s):

Patrik Lambert ◽

Rafael Banchs

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignment ◽

Training Phase ◽

Initial Alignment ◽

The One ◽

Linear Alignment ◽

Phrase Alignment ◽

Translation Systems ◽

Better Than

BIA: a Discriminative Phrase Alignment Toolkit In most statistical machine translation systems, bilingual segments are extracted via word alignment. However, word alignment is performed independently from the requirements of the machine translation task. Furthermore, although phrase-based translation models have replaced word-based translation models nearly ten years ago, word-based models are still widely used for word alignment. In this paper we present the BIA (BIlingual Aligner) toolkit, a suite consisting of a discriminative phrase-based word alignment decoder based on linear alignment models, along with training and tuning tools. In the training phase, relative link probabilities are calculated based on an initial alignment. The tuning of the model weights may be performed directly according to machine translation metrics. We give implementation details and report results of experiments conducted on the Spanish-English Europarl task (with three corpus sizes), on the Chinese-English FBIS task, and on the Chinese-English BTEC task. The BLEU score obtained with BIA alignment is always as good or better than the one obtained with the initial alignment used to train BIA models. In addition, in four out of the five tasks, the BIA toolkit yields the best BLEU score of a collection of ten alignment systems. Finally, usage guidelines are presented.

Download Full-text

Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation

Computational Linguistics ◽

10.1162/coli.2010.36.2.09054 ◽

2010 ◽

Vol 36 (2) ◽

pp. 247-277 ◽

Cited By ~ 11

Author(s):

Wei Wang ◽

Jonathan May ◽

Kevin Knight ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Syntactic Structure ◽

Statistical Machine Translation ◽

Training Data ◽

Word Alignment ◽

The Em Algorithm ◽

Rule Application ◽

Word Alignments ◽

Parse Trees

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.

Download Full-text

Efficient Word Alignment with Markov Chain Monte Carlo

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0013 ◽

2016 ◽

Vol 106 (1) ◽

pp. 125-146 ◽

Cited By ~ 1

Author(s):

Robert Östling ◽

Jörg Tiedemann

Keyword(s):

Monte Carlo ◽

Markov Chain ◽

Markov Chain Monte Carlo ◽

Critical Word ◽

Statistical Machine Translation ◽

Monte Carlo Sampling ◽

Word Alignment ◽

Translation Quality ◽

Word Alignments ◽

Selection Of

Abstract We present EFMARAL, a new system for efficient and accurate word alignment using a Bayesian model with Markov Chain Monte Carlo (MCMC) inference. Through careful selection of data structures and model architecture we are able to surpass the fast_align system, commonly used for performance-critical word alignment, both in computational efficiency and alignment accuracy. Our evaluation shows that a phrase-based statistical machine translation (SMT) system produces translations of higher quality when using word alignments from EFMARAL than from fast_align, and that translation quality is on par with what is obtained using GIZA++, a tool requiring orders of magnitude more processing time. More generally we hope to convince the reader that Monte Carlo sampling, rather than being viewed as a slow method of last resort, should actually be the method of choice for the SMT practitioner and others interested in word alignment.

Download Full-text

Efficient Use of Resources for Statistical Machine Translation

DESIDOC Journal of Library & Information Technology ◽

10.14429/djlit.37.5.11420 ◽

2017 ◽

Vol 37 (5) ◽

pp. 307

Author(s):

Karunesh Kumar Arora ◽

Shyam Sunder Agrawal

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Noise Removal ◽

Translation System ◽

Test Case ◽

Digital Collections ◽

Training Corpus ◽

Use Of Resources ◽

Parallel Data ◽

Translation Systems

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to </span><span>having more number of orthographic inflections for a given word and presence of non-standard word spellings in </span><span>the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. </span></p></div></div></div>

Download Full-text

Efficient Use of Resources for Statistical Machine Translation

DESIDOC Journal of Library & Information Technology ◽

10.14429/djlit.37.11420 ◽

2017 ◽

Vol 37 (5) ◽

pp. 307

Author(s):

Karunesh Kumar Arora ◽

Shyam Sunder Agrawal

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Noise Removal ◽

Translation System ◽

Test Case ◽

Digital Collections ◽

Training Corpus ◽

Use Of Resources ◽

Parallel Data ◽

Translation Systems

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to </span><span>having more number of orthographic inflections for a given word and presence of non-standard word spellings in </span><span>the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. </span></p></div></div></div>

Download Full-text

Does GIZA++ Make Search Errors?

Computational Linguistics ◽

10.1162/coli_a_00008 ◽

2010 ◽

Vol 36 (3) ◽

pp. 295-302 ◽

Cited By ~ 2

Author(s):

Sujith Ravi ◽

Kevin Knight

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignment ◽

Alignment Algorithm

Word alignment is a critical procedure within statistical machine translation (SMT). Brown et al. (1993) have provided the most popular word alignment algorithm to date, one that has been implemented in the GIZA (Al-Onaizan et al., 1999) and GIZA++ (Och and Ney 2003) software and adopted by nearly every SMT project. In this article, we investigate whether this algorithm makes search errors when it computes Viterbi alignments, that is, whether it returns alignments that are sub-optimal according to a trained model.

Download Full-text

Word Reordering Alignment for Combination of Statistical Machine Translation Systems

2008 6th International Symposium on Chinese Spoken Language Processing ◽

10.1109/chinsl.2008.ecp.80 ◽

2008 ◽

Cited By ~ 1

Author(s):

Maoxi Li ◽

Chengqing Zong

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation Systems

Download Full-text