CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task

The data-driven Bulgarian WordNet: BTBWN

Cognitive Studies | Études cognitives ◽

10.11649/cs.1713 ◽

2018 ◽

Author(s):

Petya Osenova ◽

Kiril Simov

Keyword(s):

Information Retrieval ◽

Machine Translation ◽

Semantic Information ◽

Data Driven ◽

Lexical Resources ◽

Multilingual Information Retrieval ◽

Cross Lingual ◽

Princeton Wordnet ◽

Word Senses

The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji.

Download Full-text

Domain Adaptation of Statistical Machine Translation Models with Monolingual Data for Cross Lingual Information Retrieval

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-642-36973-5_80 ◽

2013 ◽

pp. 768-771 ◽

Cited By ~ 1

Author(s):

Vassilina Nikoulina ◽

Stéphane Clinchant

Keyword(s):

Information Retrieval ◽

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Cross Lingual

Download Full-text

Cross-lingual text similarity exploiting neural machine translation models

Journal of Information Science ◽

10.1177/0165551520912676 ◽

2020 ◽

pp. 016555152091267 ◽

Cited By ~ 1

Author(s):

Kazuhiro Seki

Keyword(s):

Machine Translation ◽

Learning To Rank ◽

Translation System ◽

Text Similarity ◽

Retrieval Task ◽

Neural Machine Translation ◽

Intermediate States ◽

Machine Translation System ◽

Cross Lingual ◽

Types Of Information

This article studies cross-lingual text similarity using neural machine translation models. A straightforward approach based on machine translation is to use translated text so as to make the problem monolingual. Another possible approach is to use intermediate states of machine translation models as recently proposed in the related work, which could avoid propagation of translation errors. We aim at improving both approaches independently and then combine the two types of information, that is, translations and intermediate states, in a learning-to-rank framework to compute cross-lingual text similarity. To evaluate the effectiveness and generalisability of our approach, we conduct empirical experiments on English–Japanese and English–Hindi translation corpora for a cross-lingual sentence retrieval task. It is demonstrated that our approach using translations and intermediate states outperforms other neural network–based approaches and is even comparable with a strong baseline based on a state-of-the-art machine translation system.

Download Full-text

A Neural-Network-Based Approach to Chinese–Uyghur Organization Name Translation

Information ◽

10.3390/info11100492 ◽

2020 ◽

Vol 11 (10) ◽

pp. 492

Author(s):

Aishan Wumaier ◽

Cuiyun Xu ◽

Zaokere Kadeer ◽

Wenqi Liu ◽

Yingbo Wang ◽

...

Keyword(s):

Neural Network ◽

Information Retrieval ◽

Machine Translation ◽

Word Segmentation ◽

Translation System ◽

Attention Model ◽

Segmentation Approach ◽

Cross Lingual ◽

Agglutinative Languages ◽

Transformer Model

The recognition and translation of organization names (ONs) is challenging due to the complex structures and high variability involved. ONs consist not only of common generic words but also names, rare words, abbreviations and business and industry jargon. ONs are a sub-class of named entity (NE) phrases, which convey key information in text. As such, the correct translation of ONs is critical for machine translation and cross-lingual information retrieval. The existing Chinese–Uyghur neural machine translation systems have performed poorly when applied to ON translation tasks. As there are no publicly available Chinese–Uyghur ON translation corpora, an ON translation corpus is developed here, which includes 191,641 ON translation pairs. A word segmentation approach involving characterization, tagged characterization, byte pair encoding (BPE) and syllabification is proposed here for ON translation tasks. A recurrent neural network (RNN) attention framework and transformer are adapted here for ON translation tasks with different sequence granularities. The experimental results indicate that the transformer model not only outperforms the RNN attention model but also benefits from the proposed word segmentation approach. In addition, a Chinese–Uyghur ON translation system is developed here to automatically generate new translation pairs. This work significantly improves Chinese–Uyghur ON translation and can be applied to improve Chinese–Uyghur machine translation and cross-lingual information retrieval. It can also easily be extended to other agglutinative languages.

Download Full-text

An Improvement in Statistical Machine Translation in Perspective of Hindi-English Cross-Lingual Information Retrieval

Computación y Sistemas ◽

10.13053/cys-22-4-3069 ◽

2018 ◽

Vol 22 (4) ◽

Author(s):

Vijay Kumar Sharma ◽

Namita Mittal

Keyword(s):

Information Retrieval ◽

Machine Translation ◽

Statistical Machine Translation ◽

Cross Lingual

Download Full-text

On the memorability of icons in an information retrieval task

Behaviour and Information Technology ◽

10.1080/01449298808901869 ◽

1988 ◽

Vol 7 (2) ◽

pp. 131-151 ◽

Cited By ~ 20

Author(s):

M. W. Lansdale

Keyword(s):

Information Retrieval ◽

Retrieval Task

Download Full-text

Merging Strategy for Cross-Lingual Information Retrieval Systems based on Learning Vector Quantization

Neural Processing Letters ◽

10.1007/s11063-005-2659-y ◽

2005 ◽

Vol 22 (2) ◽

pp. 149-161 ◽

Cited By ~ 1

Author(s):

M. T. Martín-Valdivia ◽

F. Martínez-Santiago ◽

L. A. Ureña-López

Keyword(s):

Information Retrieval ◽

Vector Quantization ◽

Learning Vector Quantization ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Cross Lingual ◽

Merging Strategy

Download Full-text

A study of user profile representation for personalized cross-language information retrieval

Aslib Journal of Information Management ◽

10.1108/ajim-06-2015-0091 ◽

2016 ◽

Vol 68 (4) ◽

pp. 448-477 ◽

Cited By ~ 5

Author(s):

Dong Zhou ◽

Séamus Lawless ◽

Xuan Wu ◽

Wenyu Zhao ◽

Jianxun Liu

Keyword(s):

Information Retrieval ◽

Query Expansion ◽

User Profile ◽

User Profiles ◽

Content Type ◽

Cross Language Information Retrieval ◽

Cross Lingual ◽

Cross Language ◽

Representation Techniques ◽

Comprehensive Study

Purpose – With an increase in the amount of multilingual content on the World Wide Web, users are often striving to access information provided in a language of which they are non-native speakers. The purpose of this paper is to present a comprehensive study of user profile representation techniques and investigate their use in personalized cross-language information retrieval (CLIR) systems through the means of personalized query expansion. Design/methodology/approach – The user profiles consist of weighted terms computed by using frequency-based methods such as tf-idf and BM25, as well as various latent semantic models trained on monolingual documents and cross-lingual comparable documents. This paper also proposes an automatic evaluation method for comparing various user profile generation techniques and query expansion methods. Findings – Experimental results suggest that latent semantic-weighted user profile representation techniques are superior to frequency-based methods, and are particularly suitable for users with a sufficient amount of historical data. The study also confirmed that user profiles represented by latent semantic models trained on a cross-lingual level gained better performance than the models trained on a monolingual level. Originality/value – Previous studies on personalized information retrieval systems have primarily investigated user profiles and personalization strategies on a monolingual level. The effect of utilizing such monolingual profiles for personalized CLIR remains unclear. The current study fills the gap by a comprehensive study of user profile representation for personalized CLIR and a novel personalized CLIR evaluation methodology to ensure repeatable and controlled experiments can be conducted.

Download Full-text

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

10.18653/v1/2020.acl-main.151 ◽

2020 ◽

Author(s):

Wei Zhao ◽

Goran Glavaš ◽

Maxime Peyrard ◽

Yang Gao ◽

Robert West ◽

...

Keyword(s):

Machine Translation ◽

Machine Translation Evaluation ◽

Cross Lingual ◽

Free Machine

Download Full-text

The Systran NLP Browser: An Application of Machine Translation Technology in Cross-Language Information Retrieval

Cross-Language Information Retrieval ◽

10.1007/978-1-4615-5661-9_9 ◽

1998 ◽

pp. 105-118 ◽

Cited By ~ 7

Author(s):

Denis A. Gachot ◽

Elke Lange ◽

Jin Yang

Keyword(s):

Information Retrieval ◽

Machine Translation ◽

Cross Language Information Retrieval ◽

Cross Language

Download Full-text