Improving syntactic rule extraction through deleting spurious links with translation span alignment

2013 ◽  
Vol 21 (2) ◽  
pp. 227-249 ◽  
Author(s):  
JINGBO ZHU ◽  
QIANG LI ◽  
TONG XIAO

AbstractMost statistical machine translation systems typically rely on word alignments to extract translation rules. This approach would suffer from a practical problem that even one spurious word alignment link can prevent some desirable translation rules from being extracted. To address this issue, this paper presents two approaches, referred to as sub-tree alignment and phrase-based forced decoding methods, to automatically learn translation span alignments from parallel data. Then, we improve the translation rule extraction by deleting spurious links and inserting new links based on bilingual translation span correspondences. Some comparison experiments are designed to demonstrate the effectiveness of the proposed approaches.

2013 ◽  
Vol 48 ◽  
pp. 733-782 ◽  
Author(s):  
T. Xiao ◽  
J. Zhu

This article presents a probabilistic sub-tree alignment model and its application to tree-to-tree machine translation. Unlike previous work, we do not resort to surface heuristics or expensive annotated data, but instead derive an unsupervised model to infer the syntactic correspondence between two languages. More importantly, the developed model is syntactically-motivated and does not rely on word alignments. As a by-product, our model outputs a sub-tree alignment matrix encoding a large number of diverse alignments between syntactic structures, from which machine translation systems can efficiently extract translation rules that are often filtered out due to the errors in 1-best alignment. Experimental results show that the proposed approach outperforms three state-of-the-art baseline approaches in both alignment accuracy and grammar quality. When applied to machine translation, our approach yields a +1.0 BLEU improvement and a -0.9 TER reduction on the NIST machine translation evaluation corpora. With tree binarization and fuzzy decoding, it even outperforms a state-of-the-art hierarchical phrase-based system.


2016 ◽  
Vol 22 (4) ◽  
pp. 549-573 ◽  
Author(s):  
SANJIKA HEWAVITHARANA ◽  
STEPHAN VOGEL

AbstractMining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.


2010 ◽  
Vol 36 (3) ◽  
pp. 303-339 ◽  
Author(s):  
Yang Liu ◽  
Qun Liu ◽  
Shouxun Lin

Word alignment plays an important role in many NLP tasks as it indicates the correspondence between words in a parallel text. Although widely used to align large bilingual corpora, generative models are hard to extend to incorporate arbitrary useful linguistic information. This article presents a discriminative framework for word alignment based on a linear model. Within this framework, all knowledge sources are treated as feature functions, which depend on a source language sentence, a target language sentence, and the alignment between them. We describe a number of features that could produce symmetric alignments. Our model is easy to extend and can be optimized with respect to evaluation metrics directly. The model achieves state-of-the-art alignment quality on three word alignment shared tasks for five language pairs with varying divergence and richness of resources. We further show that our approach improves translation performance for various statistical machine translation systems.


2012 ◽  
Vol 97 (1) ◽  
pp. 43-53
Author(s):  
Patrik Lambert ◽  
Rafael Banchs

BIA: a Discriminative Phrase Alignment Toolkit In most statistical machine translation systems, bilingual segments are extracted via word alignment. However, word alignment is performed independently from the requirements of the machine translation task. Furthermore, although phrase-based translation models have replaced word-based translation models nearly ten years ago, word-based models are still widely used for word alignment. In this paper we present the BIA (BIlingual Aligner) toolkit, a suite consisting of a discriminative phrase-based word alignment decoder based on linear alignment models, along with training and tuning tools. In the training phase, relative link probabilities are calculated based on an initial alignment. The tuning of the model weights may be performed directly according to machine translation metrics. We give implementation details and report results of experiments conducted on the Spanish-English Europarl task (with three corpus sizes), on the Chinese-English FBIS task, and on the Chinese-English BTEC task. The BLEU score obtained with BIA alignment is always as good or better than the one obtained with the initial alignment used to train BIA models. In addition, in four out of the five tasks, the BIA toolkit yields the best BLEU score of a collection of ten alignment systems. Finally, usage guidelines are presented.


2010 ◽  
Vol 36 (2) ◽  
pp. 247-277 ◽  
Author(s):  
Wei Wang ◽  
Jonathan May ◽  
Kevin Knight ◽  
Daniel Marcu

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.


2016 ◽  
Vol 106 (1) ◽  
pp. 125-146 ◽  
Author(s):  
Robert Östling ◽  
Jörg Tiedemann

Abstract We present EFMARAL, a new system for efficient and accurate word alignment using a Bayesian model with Markov Chain Monte Carlo (MCMC) inference. Through careful selection of data structures and model architecture we are able to surpass the fast_align system, commonly used for performance-critical word alignment, both in computational efficiency and alignment accuracy. Our evaluation shows that a phrase-based statistical machine translation (SMT) system produces translations of higher quality when using word alignments from EFMARAL than from fast_align, and that translation quality is on par with what is obtained using GIZA++, a tool requiring orders of magnitude more processing time. More generally we hope to convince the reader that Monte Carlo sampling, rather than being viewed as a slow method of last resort, should actually be the method of choice for the SMT practitioner and others interested in word alignment.


2017 ◽  
Vol 37 (5) ◽  
pp. 307
Author(s):  
Karunesh Kumar Arora ◽  
Shyam Sunder Agrawal

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to </span><span>having more number of orthographic inflections for a given word and presence of non-standard word spellings in </span><span>the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. </span></p></div></div></div>


2017 ◽  
Vol 37 (5) ◽  
pp. 307
Author(s):  
Karunesh Kumar Arora ◽  
Shyam Sunder Agrawal

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to </span><span>having more number of orthographic inflections for a given word and presence of non-standard word spellings in </span><span>the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. </span></p></div></div></div>


2010 ◽  
Vol 36 (3) ◽  
pp. 295-302 ◽  
Author(s):  
Sujith Ravi ◽  
Kevin Knight

Word alignment is a critical procedure within statistical machine translation (SMT). Brown et al. (1993) have provided the most popular word alignment algorithm to date, one that has been implemented in the GIZA (Al-Onaizan et al., 1999) and GIZA++ (Och and Ney 2003) software and adopted by nearly every SMT project. In this article, we investigate whether this algorithm makes search errors when it computes Viterbi alignments, that is, whether it returns alignments that are sub-optimal according to a trained model.


Sign in / Sign up

Export Citation Format

Share Document