Neural generative models and representation learning for information retrieval

Information Retrieval (IR) concerns about the structure, analysis, organization, storage, and retrieval of information. Among different retrieval models proposed in the past decades, generative retrieval models, especially those under the statistical probabilistic framework, are one of the most popular techniques that have been widely applied to Information Retrieval problems. While they are famous for their well-grounded theory and good empirical performance in text retrieval, their applications in IR are often limited by their complexity and low extendability in the modeling of high-dimensional information. Recently, advances in deep learning techniques provide new opportunities for representation learning and generative models for information retrieval. In contrast to statistical models, neural models have much more flexibility because they model information and data correlation in latent spaces without explicitly relying on any prior knowledge. Previous studies on pattern recognition and natural language processing have shown that semantically meaningful representations of text, images, and many types of information can be acquired with neural models through supervised or unsupervised training. Nonetheless, the effectiveness of neural models for information retrieval is mostly unexplored. In this thesis, we study how to develop new generative models and representation learning frameworks with neural models for information retrieval. Specifically, our contributions include three main components: (1) Theoretical Analysis : We present the first theoretical analysis and adaptation of existing neural embedding models for ad-hoc retrieval tasks; (2) Design Practice : Based on our experience and knowledge, we show how to design an embedding-based neural generative model for practical information retrieval tasks such as personalized product search; And (3) Generic Framework : We further generalize our proposed neural generative framework for complicated heterogeneous information retrieval scenarios that concern text, images, knowledge entities, and their relationships. Empirical results show that the proposed neural generative framework can effectively learn information representations and construct retrieval models that outperform the state-of-the-art systems in a variety of IR tasks.

Download Full-text

Methods and Trends in Information Retrieval in Big Data Genomic Research

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1109.0789s219 ◽

2019 ◽

Vol 8 (9S2) ◽

pp. 515-523

Keyword(s):

Big Data ◽

Information Retrieval ◽

Language Processing ◽

Genomic Research ◽

Genomic Information ◽

Genome Research ◽

Exploratory Survey ◽

The Common ◽

Object Features ◽

Text Images

This paper described information retrieval (IR) and the common methods of finding, extracting, and mining information in genomic research through text mining, and natural language processing (NLP). There was a surge of genomic information from the different literature and the production of genome datasets that catapulted the development of several tools for analyzing and presenting new found knowledge in the biomedical and genome research. This paper presented the recent research trends, survey, reviews, experiments, and concepts in information retrieval applied to text, images and object features in big data genomic research. The method used is exploratory survey research in IR uses in genomic research that presents the concepts, methods, evaluation results and next steps described by the key researchers.

Download Full-text

Information retrieval models for recommender systems

ACM SIGIR Forum ◽

10.1145/3458537.3458545 ◽

2019 ◽

Vol 53 (1) ◽

pp. 44-45

Author(s):

Daniel Valcarce

Keyword(s):

Information Retrieval ◽

Recommender Systems ◽

Relevance Feedback ◽

Information Needs ◽

Ad Hoc ◽

Information Overload ◽

Group Formation ◽

Retrieval Models ◽

Information Retrieval Evaluation ◽

Pseudo Relevance Feedback

Information retrieval addresses the information needs of users by delivering relevant pieces of information but requires users to convey their information needs explicitly. In contrast, recommender systems offer personalized suggestions of items automatically. Ultimately, both fields help users cope with information overload by providing them with relevant items of information. This thesis aims to explore the connections between information retrieval and recommender systems. Our objective is to devise recommendation models inspired in information retrieval techniques. We begin by borrowing ideas from the information retrieval evaluation literature to analyze evaluation metrics in recommender systems [2]. Second, we study the applicability of pseudo-relevance feedback models to different recommendation tasks [1]. We investigate the conventional top-N recommendation task [5, 4, 6, 7], but we also explore the recently formulated user-item group formation problem [3] and propose a novel task based on the liquidation of long tail items [8]. Third, we exploit ad hoc retrieval models to compute neighborhoods in a collaborative filtering scenario [9, 10, 12]. Fourth, we explore the opposite direction by adapting an effective recommendation framework to pseudo-relevance feedback [13, 11]. Finally, we discuss the results and present our conclusions. In summary, this doctoral thesis adapts a series of information retrieval models to recommender systems. Our investigation shows that many retrieval models can be accommodated to deal with different recommendation tasks. Moreover, we find that taking the opposite path is also possible. Exhaustive experimentation confirms that the proposed models are competitive. Finally, we also perform a theoretical analysis of some models to explain their effectiveness. Advisors : Álvaro Barreiro and Javier Parapar. Committee members : Gabriella Pasi, Pablo Castells and Fidel Cacheda. The dissertation is available at: https://www.dc.fi.udc.es/~dvalcarce/thesis.pdf.

Download Full-text

Neural models for information retrieval without labeled data

ACM SIGIR Forum ◽

10.1145/3458553.3458569 ◽

2019 ◽

Vol 53 (2) ◽

pp. 104-105

Author(s):

Hamed Zamani

Keyword(s):

Neural Network ◽

Information Retrieval ◽

Performance Prediction ◽

Large Scale ◽

Deep Neural Networks ◽

State Of The Art ◽

Training Data ◽

Retrieval Model ◽

Neural Models ◽

Retrieval Models

Recent developments of machine learning models, and in particular deep neural networks, have yielded significant improvements on several computer vision, natural language processing, and speech recognition tasks. Progress with information retrieval (IR) tasks has been slower, however, due to the lack of large-scale training data as well as neural network models specifically designed for effective information retrieval [9]. In this dissertation, we address these two issues by introducing task-specific neural network architectures for a set of IR tasks and proposing novel unsupervised or weakly supervised solutions for training the models. The proposed learning solutions do not require labeled training data. Instead, in our weak supervision approach, neural models are trained on a large set of noisy and biased training data obtained from external resources, existing models, or heuristics. We first introduce relevance-based embedding models [3] that learn distributed representations for words and queries. We show that the learned representations can be effectively employed for a set of IR tasks, including query expansion, pseudo-relevance feedback, and query classification [1, 2]. We further propose a standalone learning to rank model based on deep neural networks [5, 8]. Our model learns a sparse representation for queries and documents. This enables us to perform efficient retrieval by constructing an inverted index in the learned semantic space. Our model outperforms state-of-the-art retrieval models, while performing as efficiently as term matching retrieval models. We additionally propose a neural network framework for predicting the performance of a retrieval model for a given query [7]. Inspired by existing query performance prediction models, our framework integrates several information sources, such as retrieval score distribution and term distribution in the top retrieved documents. This leads to state-of-the-art results for the performance prediction task on various standard collections. We finally bridge the gap between retrieval and recommendation models, as the two key components in most information systems. Search and recommendation often share the same goal: helping people get the information they need at the right time. Therefore, joint modeling and optimization of search engines and recommender systems could potentially benefit both systems [4]. In more detail, we introduce a retrieval model that is trained using user-item interaction (e.g., recommendation data), with no need to query-document relevance information for training [6]. Our solutions and findings in this dissertation smooth the path towards learning efficient and effective models for various information retrieval and related tasks, especially when large-scale training data is not available.

Download Full-text

A Data-Driven Strategy to Combine Word Embeddings in Information Retrieval

10.5121/csit.2021.110107 ◽

2021 ◽

Author(s):

Alfredo Silva ◽

Marcelo Mendoza

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Ad Hoc ◽

Data Driven ◽

Word Embeddings ◽

Continuous Vector ◽

Benchmark Data ◽

Promising Line ◽

Vector Representations

Word embeddings are vital descriptors of words in unigram representations of documents for many tasks in natural language processing and information retrieval. The representation of queries has been one of the most critical challenges in this area because it consists of a few terms and has little descriptive capacity. Strategies such as average word embeddings can enrich the queries' descriptive capacity since they favor the identification of related terms from the continuous vector representations that characterize these approaches. We propose a datadriven strategy to combine word embeddings. We use Idf combinations of embeddings to represent queries, showing that these representations outperform the average word embeddings recently proposed in the literature. Experimental results on benchmark data show that our proposal performs well, suggesting that data-driven combinations of word embeddings are a promising line of research in ad-hoc information retrieval.

Download Full-text

Information retrieval in an infodemic: the case of COVID-19 publications

10.1101/2021.01.29.428847 ◽

2021 ◽

Author(s):

Sohrab Ferdowsi ◽

Nikolay Borissov ◽

Elham Kashani ◽

David Vicente Alvarez ◽

Jenny Copara ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Exponential Growth ◽

Information Needs ◽

Scientific Literature ◽

The Other ◽

Retrieval Models ◽

Processing Algorithms ◽

Number Of Publications

AbstractIn the context of searching for COVID-19 related scientific literature, we present an information retrieval methodology for effectively finding relevant publications for different information needs. We discuss different components of our architecture consisting of traditional information retrieval models, as well as modern neural natural language processing algorithms. We present recipes to better adapt these components to the case of an infodemic, where, from one hand, the number of publications has an exponential growth and, from the other hand, the topics of interest evolve as the pandemic progresses. The methodology was evaluated in the TREC-COVID challenge, achieving competitive results with top ranking teams participating in the competition. In retrospect to this challenge, we provide additional insights with further useful impacts.

Download Full-text

A Survey of Information Retrieval Models for Malayalam Language Processing

International Journal of Computer Applications ◽

10.5120/18820-0230 ◽

2014 ◽

Vol 107 (14) ◽

pp. 19-23 ◽

Cited By ~ 1

Author(s):

Arjun Babu ◽

Sindhu L.

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Retrieval Models ◽

Malayalam Language

Download Full-text

Developing unsupervised knowledge-enhanced models to reduce the semantic gap in information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476433 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Stefano Marchesin

Keyword(s):

Information Retrieval ◽

Semantic Gap ◽

Neural Models ◽

Retrieval Models ◽

Knowledge Resources ◽

Test Collections ◽

External Knowledge ◽

Early Stages ◽

Specific Subset ◽

Semantic Models

In this thesis we tackle the semantic gap, a long-standing problem in Information Retrieval (IR). The semantic gap can be described as the mismatch between users' queries and the way retrieval models answer to such queries. Two main lines of work have emerged over the years to bridge the semantic gap: (i) the use of external knowledge resources to enhance the bag-of-words representations used by lexical models, and (ii) the use of semantic models to perform matching between the latent representations of queries and documents. To deal with this issue, we first perform an in-depth evaluation of lexical and semantic models through different analyses [Marchesin et al., 2019]. The objective of this evaluation is to understand what features lexical and semantic models share, if their signals are complementary, and how they can be combined to effectively address the semantic gap. In particular, the evaluation focuses on (semantic) neural models and their critical aspects. Each analysis brings a different perspective in the understanding of semantic models and their relation with lexical models. The outcomes of this evaluation highlight the differences between lexical and semantic signals, and the need to combine them at the early stages of the IR pipeline to effectively address the semantic gap. Then, we build on the insights of this evaluation to develop lexical and semantic models addressing the semantic gap. Specifically, we develop unsupervised models that integrate knowledge from external resources, and we evaluate them for the medical domain - a domain with a high social value, where the semantic gap is prominent, and the large presence of authoritative knowledge resources allows us to explore effective ways to address it. For lexical models, we investigate how - and to what extent - concepts and relations stored within knowledge resources can be integrated in query representations to improve the effectiveness of lexical models. Thus, we propose and evaluate several knowledge-based query expansion and reduction techniques [Agosti et al., 2018, 2019; Di Nunzio et al., 2019]. These query reformulations are used to increase the probability of retrieving relevant documents by adding to or removing from the original query highly specific terms. The experimental analyses on different test collections for Precision Medicine - a particular use case of Clinical Decision Support (CDS) - show the effectiveness of the proposed query reformulations. In particular, a specific subset of query reformulations allow lexical models to achieve top performing results in all the considered collections. Regarding semantic models, we first analyze the limitations of the knowledge-enhanced neural models presented in the literature. Then, to overcome these limitations, we propose SAFIR [Agosti et al., 2020], an unsupervised knowledge-enhanced neural framework for IR. SAFIR integrates external knowledge in the learning process of neural IR models and it does not require labeled data for training. Thus, the representations learned within this framework are optimized for IR and encode linguistic features that are relevant to address the semantic gap. The evaluation on different test collections for CDS demonstrate the effectiveness of SAFIR when used to perform retrieval over the entire document collection or to retrieve documents for Pseudo Relevance Feedback (PRF) methods - that is, when it is used at the early stages of the IR pipeline. In particular, the quantitative and qualitative analyses highlight the ability of SAFIR to retrieve relevant documents affected by the semantic gap, as well as the effectiveness of combining lexical and semantic models at the early stages of the IR pipeline - where the complementary signals they provide can be used to obtain better answers to semantically hard queries.

Download Full-text

Applying Light Natural Language Processing to Ad-Hoc Cross Language Information Retrieval

Accessing Multilingual Information Repositories - Lecture Notes in Computer Science ◽

10.1007/11878773_19 ◽

2006 ◽

pp. 170-178 ◽

Cited By ~ 1

Author(s):

Christina Lioma ◽

Craig Macdonald ◽

Ben He ◽

Vassilis Plachouras ◽

Iadh Ounis

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Ad Hoc ◽

Cross Language Information Retrieval ◽

Cross Language

Download Full-text

A Survey on Information Retrieval Models, Techniques and Applications

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i7.90 ◽

2017 ◽

Vol 7 (7) ◽

pp. 16 ◽

Cited By ~ 1

Author(s):

Ndengabaganizi Tonny James ◽

Rajkumar Kannan

Keyword(s):

Information Retrieval ◽

Retrieval Models ◽

Knowledge Based ◽

Long Time

It has been long time many people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. Over the last forty years, Information Retrieval (IR) has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. Information retrieval (IR) is generally concerned with the searching and retrieving of knowledge-based information from database. In this paper, we will discuss about the various models and techniques and for information retrieval. We are also providing the overview of traditional IR models.

Download Full-text

Rancang Bangun Aplikasi Chatbot Sebagai Media Pencarian Informasi Anime Menggunakan Regular Expression Pattern Matching

Jurnal ULTIMATICS ◽

10.31937/ti.v9i1.559 ◽

2017 ◽

Vol 9 (1) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

David Domarco ◽

Ni Made Satvika Iswari

Keyword(s):

Information Retrieval ◽

Expression Pattern ◽

Pattern Matching ◽

Language Processing ◽

Regular Expression ◽

Technology Development ◽

Data Retrieval ◽

Index Terms ◽

Retrieval Engine ◽

Behavioral Intention To Use

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching

Download Full-text