scholarly journals CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task

Author(s):  
Shuo Sun ◽  
Suzanna Sia ◽  
Kevin Duh
Author(s):  
Petya Osenova ◽  
Kiril Simov

The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji.


2020 ◽  
pp. 016555152091267 ◽  
Author(s):  
Kazuhiro Seki

This article studies cross-lingual text similarity using neural machine translation models. A straightforward approach based on machine translation is to use translated text so as to make the problem monolingual. Another possible approach is to use intermediate states of machine translation models as recently proposed in the related work, which could avoid propagation of translation errors. We aim at improving both approaches independently and then combine the two types of information, that is, translations and intermediate states, in a learning-to-rank framework to compute cross-lingual text similarity. To evaluate the effectiveness and generalisability of our approach, we conduct empirical experiments on English–Japanese and English–Hindi translation corpora for a cross-lingual sentence retrieval task. It is demonstrated that our approach using translations and intermediate states outperforms other neural network–based approaches and is even comparable with a strong baseline based on a state-of-the-art machine translation system.


Information ◽  
2020 ◽  
Vol 11 (10) ◽  
pp. 492
Author(s):  
Aishan Wumaier ◽  
Cuiyun Xu ◽  
Zaokere Kadeer ◽  
Wenqi Liu ◽  
Yingbo Wang ◽  
...  

The recognition and translation of organization names (ONs) is challenging due to the complex structures and high variability involved. ONs consist not only of common generic words but also names, rare words, abbreviations and business and industry jargon. ONs are a sub-class of named entity (NE) phrases, which convey key information in text. As such, the correct translation of ONs is critical for machine translation and cross-lingual information retrieval. The existing Chinese–Uyghur neural machine translation systems have performed poorly when applied to ON translation tasks. As there are no publicly available Chinese–Uyghur ON translation corpora, an ON translation corpus is developed here, which includes 191,641 ON translation pairs. A word segmentation approach involving characterization, tagged characterization, byte pair encoding (BPE) and syllabification is proposed here for ON translation tasks. A recurrent neural network (RNN) attention framework and transformer are adapted here for ON translation tasks with different sequence granularities. The experimental results indicate that the transformer model not only outperforms the RNN attention model but also benefits from the proposed word segmentation approach. In addition, a Chinese–Uyghur ON translation system is developed here to automatically generate new translation pairs. This work significantly improves Chinese–Uyghur ON translation and can be applied to improve Chinese–Uyghur machine translation and cross-lingual information retrieval. It can also easily be extended to other agglutinative languages.


2016 ◽  
Vol 68 (4) ◽  
pp. 448-477 ◽  
Author(s):  
Dong Zhou ◽  
Séamus Lawless ◽  
Xuan Wu ◽  
Wenyu Zhao ◽  
Jianxun Liu

Purpose – With an increase in the amount of multilingual content on the World Wide Web, users are often striving to access information provided in a language of which they are non-native speakers. The purpose of this paper is to present a comprehensive study of user profile representation techniques and investigate their use in personalized cross-language information retrieval (CLIR) systems through the means of personalized query expansion. Design/methodology/approach – The user profiles consist of weighted terms computed by using frequency-based methods such as tf-idf and BM25, as well as various latent semantic models trained on monolingual documents and cross-lingual comparable documents. This paper also proposes an automatic evaluation method for comparing various user profile generation techniques and query expansion methods. Findings – Experimental results suggest that latent semantic-weighted user profile representation techniques are superior to frequency-based methods, and are particularly suitable for users with a sufficient amount of historical data. The study also confirmed that user profiles represented by latent semantic models trained on a cross-lingual level gained better performance than the models trained on a monolingual level. Originality/value – Previous studies on personalized information retrieval systems have primarily investigated user profiles and personalization strategies on a monolingual level. The effect of utilizing such monolingual profiles for personalized CLIR remains unclear. The current study fills the gap by a comprehensive study of user profile representation for personalized CLIR and a novel personalized CLIR evaluation methodology to ensure repeatable and controlled experiments can be conducted.


2020 ◽  
Author(s):  
Wei Zhao ◽  
Goran Glavaš ◽  
Maxime Peyrard ◽  
Yang Gao ◽  
Robert West ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document