Further Experiments in Bilingual Text Alignment

1998 ◽  
Vol 3 (1) ◽  
pp. 115-150 ◽  
Author(s):  
Harold Somers

We describe and experimentally evaluate an alternative algorithm for aligning and extracting vocabulary from parallel texts using recency vectors and a similarity measure based on Levenshtein distance. The work is largely inspired by Fung and McKeown 's DK-vec, though we use a simpler algorithm. The technique is tested on two sets of parallel corpora involving English, French, German, Dutch, Spanish, and Japanese. We attempt to evaluate the importance of parameters such as frequency of words chosen as candidates, the effect of different language pairings, and differences between the two corpora.

2020 ◽  
Vol 56 (4) ◽  
pp. 629-650
Author(s):  
Filip Graliński ◽  
Krzysztof Jassem

Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.


2016 ◽  
Vol 22 (4) ◽  
pp. 517-548 ◽  
Author(s):  
ANN IRVINE ◽  
CHRIS CALLISON-BURCH

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.


2019 ◽  
Vol 26 (2) ◽  
pp. 163-182 ◽  
Author(s):  
Serge Sharoff

AbstractSome languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.


2003 ◽  
Vol 29 (3) ◽  
pp. 381-419 ◽  
Author(s):  
Wessel Kraaij ◽  
Jian-Yun Nie ◽  
Michel Simard

Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since current models for information retrieval (IR) are still based on a bag of words. The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.


1997 ◽  
Vol 2 (1) ◽  
pp. 1-22 ◽  
Author(s):  
Alexander Geyken

This paper addresses the question to what extent translations in bilingual parallel corpora match with dictionary senses. Automatic matching of corpus translation with dictionary senses depends on the quality of the lexicographic knowledge used, the quality of corpus processing, the impact of statistics to filter relevant entries from the corpora, and finally the quality of the translations in the multilingual corpora. We focus on the influence that the latter variable has on the performance of the automatic matching. Similarly to previous approaches, we relied on Machine Readable Dictionaries (MRDs), a part-of-speech tagger, and bilingual aligned corpora. Additionally, we used a shallow sentence parser for syntactic matching. Two case studies with two different corpora from different domains were conducted. Our test set was the intersection of 500 French communication verbs within the corpora. The results confirm that the performance of the automatic matching varies considerably with the translation quality of the parallel texts.


Author(s):  
EMILIA KUBICKA

The study considers the question of whether (and how) bilingual dictionaries may be improved. The information presented in dictionaries has been confronted with textual reality (i.e., with examples of actual translations), based on the German expression fassungslos and its Polish equivalents in parallel texts. The author assumes that bilingual dictionaries are mainly used by language learners, while professional translators may consider them as one of many possible sources. In teaching, multiplying the possible equivalents or suggesting ad hoc solutions is generally not recommended. Despite the attempts at objectivizing lexicographic descriptions, which are made possible by using language corpora, it often turns out that the decisions made by dictionary authors are (and need to be) arbitrary


2013 ◽  
Vol 39 (4) ◽  
pp. 999-1023 ◽  
Author(s):  
Gennadi Lembersky ◽  
Noam Ordan ◽  
Shuly Wintner

Translation models used for statistical machine translation are compiled from parallel corpora that are manually translated. The common assumption is that parallel texts are symmetrical: The direction of translation is deemed irrelevant and is consequently ignored. Much research in Translation Studies indicates that the direction of translation matters, however, as translated language (translationese) has many unique properties. It has already been shown that phrase tables constructed from parallel corpora translated in the same direction as the translation task outperform those constructed from corpora translated in the opposite direction. We reconfirm that this is indeed the case, but emphasize the importance of also using texts translated in the “wrong” direction. We take advantage of information pertaining to the direction of translation in constructing phrase tables by adapting the translation model to the special properties of translationese. We explore two adaptation techniques: First, we create a mixture model by interpolating phrase tables trained on texts translated in the “right” and the “wrong” directions. The weights for the interpolation are determined by minimizing perplexity. Second, we define entropy-based measures that estimate the correspondence of target-language phrases to translationese, thereby eliminating the need to annotate the parallel corpus with information pertaining to the direction of translation. We show that incorporating these measures as features in the phrase tables of statistical machine translation systems results in consistent, statistically significant improvement in the quality of the translation.


Sign in / Sign up

Export Citation Format

Share Document