Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora

Bilingual machine readable dictionaries are important and indispensable resources of information for cross-language information retrieval, and machine translation. Recently, these cross-language informational activities have begun to focus on specific academic or technological domains. In this paper, we describe a bilingual dictionary acquisition system which extracts translations from non-parallel but comparable corpora of a specific academic domain and disambiguates the extracted translations. The proposed method is two-fold. At the first stage, candidate terms are extracted from a Japanese and English corpus, respectively, and ranked according to their importance as terms. At the second stage, ambiguous translations are resolved by selecting the target language translation which is the nearest in rank to the source language term. Finally, we evaluate the proposed method in an experiment.

Download Full-text

Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

Information Processing & Management ◽

10.1016/j.ipm.2015.08.001 ◽

2016 ◽

Vol 52 (2) ◽

pp. 299-318 ◽

Cited By ~ 12

Author(s):

Razieh Rahimi ◽

Azadeh Shakery ◽

Irwin King

Keyword(s):

Information Retrieval ◽

Language Modeling ◽

Modeling Framework ◽

Comparable Corpora ◽

Cross Language Information Retrieval ◽

Cross Language

Download Full-text

Effects of Comparable Corpora on Cross-Language Information Retrieval

Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science ◽

10.5220/0003029200530059 ◽

2010 ◽

Keyword(s):

Information Retrieval ◽

Comparable Corpora ◽

Cross Language Information Retrieval ◽

Cross Language

Download Full-text

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval ◽

10.1007/s10791-012-9200-5 ◽

2012 ◽

Vol 16 (3) ◽

pp. 331-368 ◽

Cited By ~ 26

Author(s):

Ivan Vulić ◽

Wim De Smet ◽

Marie-Francine Moens

Keyword(s):

Information Retrieval ◽

Topic Models ◽

Retrieval Models ◽

Comparable Corpora ◽

Cross Language Information Retrieval ◽

Latent Topic ◽

Cross Language

Download Full-text

Using Comparable Corpora to Improve the Effectiveness of Cross-Language Information Retrieval

Advances in Natural Language Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-14770-8_36 ◽

2010 ◽

pp. 320-331 ◽

Cited By ~ 1

Author(s):

Fatiha Sadat

Keyword(s):

Information Retrieval ◽

Comparable Corpora ◽

Cross Language Information Retrieval ◽

Cross Language

Download Full-text

Research on Bilingual Corpus Based Machine Translation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1683 ◽

2014 ◽

Vol 687-691 ◽

pp. 1683-1686

Author(s):

Shuang Wang

Keyword(s):

Information Retrieval ◽

Machine Translation ◽

Large Scale ◽

Basic Information ◽

High Quality ◽

Main Method ◽

Cross Language Information Retrieval ◽

Automatic Acquisition ◽

Cross Language ◽

Different Characteristics

This thesis proposes several methods for bilingual corpus form different websites, such as Automatic acquisition of bilingual corpus base on "iciba" web, CNKI and Patent network. It introduced methods, procedures of the acquisition of a variety of corpus. We proposed different methods to obtain the bilingual corpus for different characteristics of different sites, and achieved fast and accurate automatic access of a large-scale bilingual corpus. When we obtain the bilingual corpus based on "iciba" web, the main method is Nutch crawler, which is relatively good, and has an accurate retrieve and a good correlation. In addition, we give up the idea of bilingual corpus obtained from the entire Internet, but we use an entirely new access, that is to access to the basic information of scholarly thesis’s in the CNKI to obtain the large-scale high-quality English-Chinese bilingual corpus. We obtain GB level of large-scale bilingual aligned corpus in the end, which is very accurate by the manual evaluation. And the corpus makes preparation for the further cross-language information retrieval research.

Download Full-text