Mining Parallel Knowledge from Comparable Patents

Author(s):  
Bin Lu ◽  
Benjamin K. Tsou ◽  
Tao Jiang ◽  
Jingbo Zhu ◽  
Oi Yee Kwong

The extracted parallel sentences and technical terms could be a good basis for further acquisition of term relations and the translation of monolingual ontologies, as well as for statistical machine translation systems and other cross-lingual information access applications.

2015 ◽  
Vol 23 (1) ◽  
pp. 31-51 ◽  
Author(s):  
H. HAKAMI ◽  
D. BOLLEGALA

AbstractFinding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.


2016 ◽  
Vol 42 (2) ◽  
pp. 277-306 ◽  
Author(s):  
Pidong Wang ◽  
Preslav Nakov ◽  
Hwee Tou Ng

Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.


Author(s):  
Michael Carl ◽  
Moritz Jonas Schaeffer

The paper develops a definition of translation literality that is based on the syntactic and semantic similarity of the source and the target texts. We provide theoretical and empirical evidence that absolute literal translations are easy to produce. Based on a multilingual corpus of alternative translations we investigate the effects of cross-lingual syntactic and semantic distance on translation production times and find that non-literality makes from-scratch translation and post-editing difficult. We show that statistical machine translation systems encounter even more difficulties with non-literality.


2018 ◽  
Author(s):  
Benjamin Marie ◽  
Rui Wang ◽  
Atsushi Fujita ◽  
Masao Utiyama ◽  
Eiichiro Sumita

2016 ◽  
Vol 5 (4) ◽  
pp. 51-66 ◽  
Author(s):  
Krzysztof Wolk ◽  
Krzysztof P. Marasek

The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. A comparison and implementation of a medical translator is the main focus of our experiments.


2016 ◽  
Vol 22 (4) ◽  
pp. 549-573 ◽  
Author(s):  
SANJIKA HEWAVITHARANA ◽  
STEPHAN VOGEL

AbstractMining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.


Sign in / Sign up

Export Citation Format

Share Document