Developing unsupervised knowledge-enhanced models to reduce the semantic gap in information retrieval

In this thesis we tackle the semantic gap, a long-standing problem in Information Retrieval (IR). The semantic gap can be described as the mismatch between users' queries and the way retrieval models answer to such queries. Two main lines of work have emerged over the years to bridge the semantic gap: (i) the use of external knowledge resources to enhance the bag-of-words representations used by lexical models, and (ii) the use of semantic models to perform matching between the latent representations of queries and documents. To deal with this issue, we first perform an in-depth evaluation of lexical and semantic models through different analyses [Marchesin et al., 2019]. The objective of this evaluation is to understand what features lexical and semantic models share, if their signals are complementary, and how they can be combined to effectively address the semantic gap. In particular, the evaluation focuses on (semantic) neural models and their critical aspects. Each analysis brings a different perspective in the understanding of semantic models and their relation with lexical models. The outcomes of this evaluation highlight the differences between lexical and semantic signals, and the need to combine them at the early stages of the IR pipeline to effectively address the semantic gap. Then, we build on the insights of this evaluation to develop lexical and semantic models addressing the semantic gap. Specifically, we develop unsupervised models that integrate knowledge from external resources, and we evaluate them for the medical domain - a domain with a high social value, where the semantic gap is prominent, and the large presence of authoritative knowledge resources allows us to explore effective ways to address it. For lexical models, we investigate how - and to what extent - concepts and relations stored within knowledge resources can be integrated in query representations to improve the effectiveness of lexical models. Thus, we propose and evaluate several knowledge-based query expansion and reduction techniques [Agosti et al., 2018, 2019; Di Nunzio et al., 2019]. These query reformulations are used to increase the probability of retrieving relevant documents by adding to or removing from the original query highly specific terms. The experimental analyses on different test collections for Precision Medicine - a particular use case of Clinical Decision Support (CDS) - show the effectiveness of the proposed query reformulations. In particular, a specific subset of query reformulations allow lexical models to achieve top performing results in all the considered collections. Regarding semantic models, we first analyze the limitations of the knowledge-enhanced neural models presented in the literature. Then, to overcome these limitations, we propose SAFIR [Agosti et al., 2020], an unsupervised knowledge-enhanced neural framework for IR. SAFIR integrates external knowledge in the learning process of neural IR models and it does not require labeled data for training. Thus, the representations learned within this framework are optimized for IR and encode linguistic features that are relevant to address the semantic gap. The evaluation on different test collections for CDS demonstrate the effectiveness of SAFIR when used to perform retrieval over the entire document collection or to retrieve documents for Pseudo Relevance Feedback (PRF) methods - that is, when it is used at the early stages of the IR pipeline. In particular, the quantitative and qualitative analyses highlight the ability of SAFIR to retrieve relevant documents affected by the semantic gap, as well as the effectiveness of combining lexical and semantic models at the early stages of the IR pipeline - where the complementary signals they provide can be used to obtain better answers to semantically hard queries.

Download Full-text

Simple but Effective Knowledge-Based Query Reformulations for Precision Medicine Retrieval

Information ◽

10.3390/info12100402 ◽

2021 ◽

Vol 12 (10) ◽

pp. 402

Author(s):

Stefano Marchesin ◽

Giorgio Maria Di Nunzio ◽

Maristella Agosti

Keyword(s):

Precision Medicine ◽

Query Expansion ◽

Semantic Gap ◽

Bag Of Words ◽

Retrieval Models ◽

Knowledge Resources ◽

Test Collections ◽

Knowledge Based ◽

Reduction Techniques ◽

Specific Subset

In Information Retrieval (IR), the semantic gap represents the mismatch between users’ queries and how retrieval models answer to these queries. In this paper, we explore how to use external knowledge resources to enhance bag-of-words representations and reduce the effect of the semantic gap between queries and documents. In this regard, we propose several simple but effective knowledge-based query expansion and reduction techniques, and we evaluate them for the medical domain. The query reformulations proposed are used to increase the probability of retrieving relevant documents through the addition to, or the removal from, the original query of highly specific terms. The experimental analyses on different test collections for Precision Medicine IR show the effectiveness of the developed techniques. In particular, a specific subset of query reformulations allow retrieval models to achieve top performing results in all the considered test collections.

Download Full-text

Neural generative models and representation learning for information retrieval

ACM SIGIR Forum ◽

10.1145/3458553.3458565 ◽

2019 ◽

Vol 53 (2) ◽

pp. 97-97

Author(s):

Qingyao Ai

Keyword(s):

Information Retrieval ◽

Theoretical Analysis ◽

Language Processing ◽

Ad Hoc ◽

Representation Learning ◽

Generative Models ◽

Neural Models ◽

Retrieval Models ◽

Types Of Information ◽

Text Images

Information Retrieval (IR) concerns about the structure, analysis, organization, storage, and retrieval of information. Among different retrieval models proposed in the past decades, generative retrieval models, especially those under the statistical probabilistic framework, are one of the most popular techniques that have been widely applied to Information Retrieval problems. While they are famous for their well-grounded theory and good empirical performance in text retrieval, their applications in IR are often limited by their complexity and low extendability in the modeling of high-dimensional information. Recently, advances in deep learning techniques provide new opportunities for representation learning and generative models for information retrieval. In contrast to statistical models, neural models have much more flexibility because they model information and data correlation in latent spaces without explicitly relying on any prior knowledge. Previous studies on pattern recognition and natural language processing have shown that semantically meaningful representations of text, images, and many types of information can be acquired with neural models through supervised or unsupervised training. Nonetheless, the effectiveness of neural models for information retrieval is mostly unexplored. In this thesis, we study how to develop new generative models and representation learning frameworks with neural models for information retrieval. Specifically, our contributions include three main components: (1) Theoretical Analysis : We present the first theoretical analysis and adaptation of existing neural embedding models for ad-hoc retrieval tasks; (2) Design Practice : Based on our experience and knowledge, we show how to design an embedding-based neural generative model for practical information retrieval tasks such as personalized product search; And (3) Generic Framework : We further generalize our proposed neural generative framework for complicated heterogeneous information retrieval scenarios that concern text, images, knowledge entities, and their relationships. Empirical results show that the proposed neural generative framework can effectively learn information representations and construct retrieval models that outperform the state-of-the-art systems in a variety of IR tasks.

Download Full-text

Emergent Semantics

Managing Multimedia Semantics ◽

10.4018/978-1-59140-569-6.ch015 ◽

2011 ◽

pp. 351-362

Author(s):

Viranga Ratnaike ◽

Bala Srinivasan ◽

Surya Nepal

Keyword(s):

Information Retrieval ◽

Multimedia Information ◽

Semantic Gap ◽

Multimedia Information Retrieval ◽

Sensory Data ◽

The Past ◽

Multimedia Semantics ◽

Emergent Systems ◽

Semantic Models ◽

Emergent Semantics

The semantic gap is recognized as one of the major problems in managing multimedia semantics. It is the gap between sensory data and semantic models. Often the sensory data and associated context compose situations which have not been anticipated by system architects. Emergence is a phenomenon that can be employed to deal with such unanticipated situations. In the past, researchers and practitioners paid little attention to applying the concepts of emergence to multimedia information retrieval. Recently, there have been attempts to use emergent semantics as a way of dealing with the semantic gap. This chapter aims to provide an overview of the field as it applies to multimedia. We begin with the concepts behind emergence, cover the requirements of emergent systems, and survey the existing body of research.

Download Full-text

Neural models for information retrieval without labeled data

ACM SIGIR Forum ◽

10.1145/3458553.3458569 ◽

2019 ◽

Vol 53 (2) ◽

pp. 104-105

Author(s):

Hamed Zamani

Keyword(s):

Neural Network ◽

Information Retrieval ◽

Performance Prediction ◽

Large Scale ◽

Deep Neural Networks ◽

State Of The Art ◽

Training Data ◽

Retrieval Model ◽

Neural Models ◽

Retrieval Models

Recent developments of machine learning models, and in particular deep neural networks, have yielded significant improvements on several computer vision, natural language processing, and speech recognition tasks. Progress with information retrieval (IR) tasks has been slower, however, due to the lack of large-scale training data as well as neural network models specifically designed for effective information retrieval [9]. In this dissertation, we address these two issues by introducing task-specific neural network architectures for a set of IR tasks and proposing novel unsupervised or weakly supervised solutions for training the models. The proposed learning solutions do not require labeled training data. Instead, in our weak supervision approach, neural models are trained on a large set of noisy and biased training data obtained from external resources, existing models, or heuristics. We first introduce relevance-based embedding models [3] that learn distributed representations for words and queries. We show that the learned representations can be effectively employed for a set of IR tasks, including query expansion, pseudo-relevance feedback, and query classification [1, 2]. We further propose a standalone learning to rank model based on deep neural networks [5, 8]. Our model learns a sparse representation for queries and documents. This enables us to perform efficient retrieval by constructing an inverted index in the learned semantic space. Our model outperforms state-of-the-art retrieval models, while performing as efficiently as term matching retrieval models. We additionally propose a neural network framework for predicting the performance of a retrieval model for a given query [7]. Inspired by existing query performance prediction models, our framework integrates several information sources, such as retrieval score distribution and term distribution in the top retrieved documents. This leads to state-of-the-art results for the performance prediction task on various standard collections. We finally bridge the gap between retrieval and recommendation models, as the two key components in most information systems. Search and recommendation often share the same goal: helping people get the information they need at the right time. Therefore, joint modeling and optimization of search engines and recommender systems could potentially benefit both systems [4]. In more detail, we introduce a retrieval model that is trained using user-item interaction (e.g., recommendation data), with no need to query-document relevance information for training [6]. Our solutions and findings in this dissertation smooth the path towards learning efficient and effective models for various information retrieval and related tasks, especially when large-scale training data is not available.

Download Full-text

Test collections for electronic health record-based clinical information retrieval

JAMIA Open ◽

10.1093/jamiaopen/ooz016 ◽

2019 ◽

Vol 2 (3) ◽

pp. 360-368 ◽

Cited By ~ 7

Author(s):

Yanshan Wang ◽

Andrew Wen ◽

Sijia Liu ◽

William Hersh ◽

Steven Bedrick ◽

...

Keyword(s):

Information Retrieval ◽

Electronic Health Record ◽

Mayo Clinic ◽

Clinical Information ◽

Free Text ◽

Health Record ◽

Retrieval Model ◽

Retrieval Models ◽

Test Collections ◽

Electronic Health

AbstractObjectivesTo create test collections for evaluating clinical information retrieval (IR) systems and advancing clinical IR research.Materials and MethodsElectronic health record (EHR) data, including structured and free-text data, from 45 000 patients who are a part of the Mayo Clinic Biobank cohort was retrieved from the clinical data warehouse. The clinical IR system indexed a total of 42 million free-text EHR documents. The search queries consisted of 56 topics developed through a collaboration between Mayo Clinic and Oregon Health & Science University. We described the creation of test collections, including a to-be-evaluated document pool using five retrieval models, and human assessment guidelines. We analyzed the relevance judgment results in terms of human agreement and time spent, and results of three levels of relevance, and reported performance of five retrieval models.ResultsThe two judges had a moderate overall agreement with a Kappa value of 0.49, spent a consistent amount of time judging the relevance, and were able to identify easy and difficult topics. The conventional retrieval model performed best on most topics while a concept-based retrieval model had better performance on the topics requiring conceptual level retrieval.DiscussionIR can provide an alternate approach to leveraging clinical narratives for patient information discovery as it is less dependent on semantics. Our study showed the feasibility of test collections along with a few challenges.ConclusionThe conventional test collections for evaluating the IR system show potential for successfully evaluating clinical IR systems with a few challenges to be investigated.

Download Full-text

Emergent Semantics

Multimedia Technologies ◽

10.4018/978-1-59904-953-3.ch025 ◽

2008 ◽

pp. 305-315

Author(s):

Viranga Ratnaike ◽

Bala Srinivasan ◽

Surya Nepal

Keyword(s):

Information Retrieval ◽

Multimedia Information ◽

Semantic Gap ◽

Multimedia Information Retrieval ◽

Sensory Data ◽

The Past ◽

Multimedia Semantics ◽

Emergent Systems ◽

Semantic Models ◽

Emergent Semantics

Download Full-text

A Survey on Information Retrieval Models, Techniques and Applications

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i7.90 ◽

2017 ◽

Vol 7 (7) ◽

pp. 16 ◽

Cited By ~ 1

Author(s):

Ndengabaganizi Tonny James ◽

Rajkumar Kannan

Keyword(s):

Information Retrieval ◽

Retrieval Models ◽

Knowledge Based ◽

Long Time

It has been long time many people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. Over the last forty years, Information Retrieval (IR) has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. Information retrieval (IR) is generally concerned with the searching and retrieving of knowledge-based information from database. In this paper, we will discuss about the various models and techniques and for information retrieval. We are also providing the overview of traditional IR models.

Download Full-text