Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.

Download Full-text

End-to-end statistical machine translation with zero or small parallel texts

Natural Language Engineering ◽

10.1017/s1351324916000127 ◽

2016 ◽

Vol 22 (4) ◽

pp. 517-548 ◽

Cited By ~ 5

Author(s):

ANN IRVINE ◽

CHRIS CALLISON-BURCH

Keyword(s):

Detailed Analysis ◽

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpora ◽

Low Resource ◽

Bilingual Lexicon ◽

Orthographic Similarity ◽

Discriminative Model ◽

End To End ◽

Parallel Texts

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.

Download Full-text

Parallel Treebanking Spanish-Quechua

Linguistic Issues in Language Technology ◽

10.33011/lilt.v7i.1285 ◽

2012 ◽

Vol 7 ◽

Author(s):

Annette Rios ◽

Anne Göhring ◽

Martin Volk

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Word Alignment ◽

Alignment Quality ◽

Important Prerequisite ◽

Bilingual Lexicon ◽

Preliminary Work ◽

First Impression ◽

Agglutinative Language

Parallel treebanking is greatly facilitated by automatic word alignment. We work on building a trilingual treebank for German, Spanish and Quechua. We ran different alignment experiments on parallel Spanish-Quechua texts, measured the alignment quality, and compared these results to the figures we obtained aligning a comparable corpus of Spanish-German texts. This preliminary work has shown us the best word segmentation to use for the agglutinative language Quechua with respect to alignment. We also acquired a first impression about how well Quechua can be aligned to Spanish, an important prerequisite for bilingual lexicon extraction, parallel treebanking or statistical machine translation.

Download Full-text

Translational equivalence in Statistical Machine Translation or meaning as co-occurrence

Linguistica Antverpiensia, New Series – Themes in Translation Studies ◽

10.52034/lanstts.v7i.215 ◽

2021 ◽

Vol 7 ◽

Author(s):

Lieve Macken ◽

Els Lefever

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

General Purpose ◽

Point Of View ◽

Word Alignment ◽

Word Sense ◽

Parallel Corpora ◽

Current State

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Download Full-text

Does GIZA++ Make Search Errors?

Computational Linguistics ◽

10.1162/coli_a_00008 ◽

2010 ◽

Vol 36 (3) ◽

pp. 295-302 ◽

Cited By ~ 2

Author(s):

Sujith Ravi ◽

Kevin Knight

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignment ◽

Alignment Algorithm

Word alignment is a critical procedure within statistical machine translation (SMT). Brown et al. (1993) have provided the most popular word alignment algorithm to date, one that has been implemented in the GIZA (Al-Onaizan et al., 1999) and GIZA++ (Och and Ney 2003) software and adopted by nearly every SMT project. In this article, we investigate whether this algorithm makes search errors when it computes Viterbi alignments, that is, whether it returns alignments that are sub-optimal according to a trained model.

Download Full-text

Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation

Natural Language Processing and Chinese Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-25207-0_38 ◽

2015 ◽

pp. 421-427

Author(s):

Amandyk Kartbayev

Keyword(s):

Machine Translation ◽

Simulation Modeling ◽

Statistical Machine Translation ◽

Word Alignment ◽

Modeling Methods

Download Full-text

Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

International Journal of E-Adoption ◽

10.4018/ijea.2020010104 ◽

2020 ◽

Vol 12 (1) ◽

pp. 42-51

Author(s):

Vishal Goyal ◽

Ajit Kumar ◽

Manpreet Singh Lehal

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpora ◽

Comparable Corpora ◽

Potential Source ◽

Parallel Data

Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text

Bayesian Word Alignment and Phrase Table Training for Statistical Machine Translation

IEICE Transactions on Information and Systems ◽

10.1587/transinf.e96.d.1536 ◽

2013 ◽

Vol E96.D (7) ◽

pp. 1536-1543

Author(s):

Zezhong LI ◽

Hideto IKEDA ◽

Junichi FUKUMOTO

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Alignment

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

Predicting and Using a Pragmatic Component of Lexical Aspect

Linguistic Issues in Language Technology ◽

10.33011/lilt.v13i.1389 ◽

2016 ◽

Vol 13 ◽

Author(s):

Sharid Loáiciga ◽

Cristina Grisot

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Translation System ◽

Automatic Annotation ◽

Parallel Corpora ◽

Linguistic Data ◽

Lexical Aspect ◽

Machine Translation System ◽

Automatic Systems

This paper proposes a method for improving the results of a statistical Machine Translation system using boundedness, a pragmatic component of the verbal phrase’s lexical aspect. First, the paper presents manual and automatic annotation experiments for lexical aspect in English-French parallel corpora. It will be shown that this aspectual property is identified and classified with ease both by humans and by automatic systems. Second, Statistical Machine Translation experiments using the boundedness annotations are presented. These experiments show that the information regarding lexical aspect is useful to improve the output of a Machine Translation system in terms of better choices of verbal tenses in the target language, as well as better lexical choices. Ultimately, this work aims at providing a method for the automatic annotation of data with boundedness information and at contributing to Machine Translation by taking into account linguistic data.

Download Full-text