Learning Curve with Machine Translation Based on Parallel, Bilingual Corpora

Author(s):  
Maciej Kowalski
2017 ◽  
Vol 108 (1) ◽  
pp. 283-294 ◽  
Author(s):  
Álvaro Peris ◽  
Mara Chinea-Ríos ◽  
Francisco Casacuberta

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.


Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1493
Author(s):  
Hanan A. Hosni Mahmoud ◽  
Hanan Abdullah Mengash

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.


Author(s):  
Rajesh. K. S ◽  
Veena A Kumar ◽  
CH. Dayakar Reddy

Word alignment in bilingual corpora has been a very active research topic in the Machine Translation research groups. In this research paper, we describe an alignment system that aligns English-Malayalam texts at word level in parallel sentences. The alignment of translated segments with source segments is very essential for building parallel corpora. Since word alignment research on Malayalam and English languages is still in its immaturity, it is not a trivial task for Malayalam-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other. Thus, the main purpose of this system is to construct word-aligned parallel corpus to be used in Malayalam-English machine translation. The proposed approach is a hybrid approach, a combination of corpus based and dictionary lookup approaches. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Malayalam-English Dictionary.


2015 ◽  
Vol 32 (1) ◽  
pp. 46-90 ◽  
Author(s):  
Víctor M. Sánchez-Cartagena ◽  
Juan Antonio Pérez-Ortiz ◽  
Felipe Sánchez-Martínez

2000 ◽  
Vol 5 (2) ◽  
pp. 199-230 ◽  
Author(s):  
Oliver Streiter ◽  
Leonid L. Iomdin

The research described in this paper is rooted in the endeavors to combine the advantages of corpus-based and rule-based MT approaches in order to improve the performance of MT systems—most importantly, the quality of translation. The authors review the ongoing activities in the field and present a case study, which shows how translation knowledge can be drawn from parallel corpora and compiled into the lexicon of a rule-based MT system. These data are obtained with the help of three procedures: (1) identification of hence unknown one-word translations, (2) statistical rating of the known one-word translations, and (3) extraction of new translations of multiword expressions (MWEs) followed by compilation steps which create new rules for the MT engine. As a result, the lexicon is enriched with translation equivalents attested for different subject domains, which facilitates the tuning of the MT system to a specific subject domain and improves the quality and adequacy of translation.


Author(s):  
SHARANBASAPPA HONNASHETTY ◽  
DR. M. HANUMANTHAPPA

Machine Translation has been a major focus of the NLP group since 1999, the principal focus of the Natural Language Processing group is to build a machine translation system that automatically learns translation mappings from bilingual corpora. This paper explores a novel approach for phrase based machine translation from English to Kannada and Kannada to English. The source text is analyzed then simple sentences are translated using the rules and the complex sentences are split into simple sentences later translation is performed.


2007 ◽  
Vol 177 (4S) ◽  
pp. 526-527 ◽  
Author(s):  
Michael Esposito ◽  
George Dakwar ◽  
Mutahar Ahmed ◽  
Vincent Lanteri
Keyword(s):  

2006 ◽  
Vol 175 (4S) ◽  
pp. 348-348
Author(s):  
Edward M. Gong ◽  
Albert A. Mikhail ◽  
Alvaro Lucioni ◽  
Marcelo A. Orvieto ◽  
Arieh L. Shalhav ◽  
...  

2004 ◽  
Vol 171 (4S) ◽  
pp. 50-51
Author(s):  
Elan W. Salzhauer ◽  
Mark Horowitz

Sign in / Sign up

Export Citation Format

Share Document