A study of user profile representation for personalized cross-language information retrieval

2016 ◽  
Vol 68 (4) ◽  
pp. 448-477 ◽  
Author(s):  
Dong Zhou ◽  
Séamus Lawless ◽  
Xuan Wu ◽  
Wenyu Zhao ◽  
Jianxun Liu

Purpose – With an increase in the amount of multilingual content on the World Wide Web, users are often striving to access information provided in a language of which they are non-native speakers. The purpose of this paper is to present a comprehensive study of user profile representation techniques and investigate their use in personalized cross-language information retrieval (CLIR) systems through the means of personalized query expansion. Design/methodology/approach – The user profiles consist of weighted terms computed by using frequency-based methods such as tf-idf and BM25, as well as various latent semantic models trained on monolingual documents and cross-lingual comparable documents. This paper also proposes an automatic evaluation method for comparing various user profile generation techniques and query expansion methods. Findings – Experimental results suggest that latent semantic-weighted user profile representation techniques are superior to frequency-based methods, and are particularly suitable for users with a sufficient amount of historical data. The study also confirmed that user profiles represented by latent semantic models trained on a cross-lingual level gained better performance than the models trained on a monolingual level. Originality/value – Previous studies on personalized information retrieval systems have primarily investigated user profiles and personalization strategies on a monolingual level. The effect of utilizing such monolingual profiles for personalized CLIR remains unclear. The current study fills the gap by a comprehensive study of user profile representation for personalized CLIR and a novel personalized CLIR evaluation methodology to ensure repeatable and controlled experiments can be conducted.

Author(s):  
Vasudeva Varma ◽  
Aditya Mogadala

In this chapter, the authors start their discussion highlighting the importance of Cross Lingual and Multilingual Information Retrieval and access research areas. They then discuss the distinction between Cross Language Information Retrieval (CLIR), Multilingual Information Retrieval (MLIR), Cross Language Information Access (CLIA), and Multilingual Information Access (MLIA) research areas. In addition, in further sections, issues and challenges in these areas are outlined, and various approaches, including machine learning-based and knowledge-based approaches to address the multilingual information access, are discussed. The authors describe various subsystems of a MLIA system ranging from query processing to output generation by sharing their experience of building a MLIA system and discuss its architecture. Then evaluation aspects of the MLIA and CLIA systems are discussed at the end of this chapter.


2019 ◽  
Vol 71 (1) ◽  
pp. 72-89 ◽  
Author(s):  
Hengyi Fu

Purpose With the increasing number of online multilingual resources, cross-language information retrieval (CLIR) has drawn much attention from the information retrieval (IR) research community. However, few studies have examined how and why multilingual searchers seek information in two or more languages, specifically how they switch and mix language in queries to get satisfying results. The purpose of this paper is to focus on Chinese–English bilinguals’ intra-sentential code-switching behaviors in online searches. The scenarios and reasons of code-switching, factors that may affect code-switching, the patterns of mixed language query formulation and reformulation and how current IR systems and other search tools can facilitate such information needs were examined. Design/methodology/approach In-depth semi-structured interviews were used as the research method. In total, 30 participants were recruited based on their English proficiency, location and profession, using a purposive sampling method. Findings Four scenarios and four reasons for using Chinese–English mixed language queries to cover information needs were identified, and results suggest that linguistic and cultural/social factors are of equivalent importance in code-switching behaviors. English terms and Chinese terms in queries play different roles in searches, and mixed language queries are irreplaceable by either single language queries or other search facilitating features. Findings also suggest current search engines and tools need greater emphasis in the user interface and more user education is required. Originality/value This study presents a qualitative analysis of bilinguals’ code-switching behaviors in online searches. Findings are expected to advance the theoretical understanding of bilingual users’ search strategies and interactions with IR systems, and provide insights for designing more effective IR systems and tools to discover multilingual online resources, including cross-language controlled vocabularies, personalized CLIR tools and mixed language query assistants.


2016 ◽  
Vol 55 ◽  
pp. 249-281 ◽  
Author(s):  
Ahmad Khwileh ◽  
Debasis Ganguly ◽  
Gareth J. F. Jones

Recent years have seen significant efforts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no significant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval effectiveness may not only suffer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that different sources of evidence, e.g. the content from different fields of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata field has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR effectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving the CLIR effectiveness for UGC content.


Sign in / Sign up

Export Citation Format

Share Document