Neural methods for effective, efficient, and exposure-aware information retrieval

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Information Retrieval

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.022 ◽

2016 ◽

Author(s):

Qiaozhu Mei ◽

Dragomir Radev

Keyword(s):

Information Retrieval ◽

Digital Libraries ◽

Web Search ◽

Retrieval System ◽

Information Retrieval System ◽

Information Need ◽

System A ◽

Recent Developments ◽

Text Information ◽

Text Information Retrieval

This chapter is a basic introduction to text information retrieval. Information Retrieval (IR) refers to the activities of obtaining information resources (usually in the form of textual documents) from a much larger collection, which are relevant to an information need of the user (usually expressed as a query). Practical instances of an IR system include digital libraries and Web search engines. This chapter presents the typical architecture of an IR system, an overview of the methods corresponding to the design and the implementation of each major component of an information retrieval system, a discussion of evaluation methods for an IR system, and finally a summary of recent developments and research trends in the field of information retrieval.

Download Full-text

A Combination of Spatial Pyramid and Inverted Index for Large-Scale Image Retrieval

Computer Vision ◽

10.4018/978-1-5225-5204-8.ch054 ◽

2018 ◽

pp. 1307-1321

Author(s):

Vinh-Tiep Nguyen ◽

Thanh Duc Ngo ◽

Minh-Triet Tran ◽

Duy-Dinh Le ◽

Duc Anh Duong

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Spatial Information ◽

Real Life ◽

Inverted Index ◽

Bag Of Words ◽

Visual Words ◽

Benchmark Datasets ◽

Large Scale Image Retrieval ◽

Inverted Indexing

Large-scale image retrieval has been shown remarkable potential in real-life applications. The standard approach is based on Inverted Indexing, given images are represented using Bag-of-Words model. However, one major limitation of both Inverted Index and Bag-of-Words presentation is that they ignore spatial information of visual words in image presentation and comparison. As a result, retrieval accuracy is decreased. In this paper, the authors investigate an approach to integrate spatial information into Inverted Index to improve accuracy while maintaining short retrieval time. Experiments conducted on several benchmark datasets (Oxford Building 5K, Oxford Building 5K+100K and Paris 6K) demonstrate the effectiveness of our proposed approach.

Download Full-text

A Combination of Spatial Pyramid and Inverted Index for Large-Scale Image Retrieval

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015040103 ◽

2015 ◽

Vol 6 (2) ◽

pp. 37-51 ◽

Cited By ~ 2

Author(s):

Vinh-Tiep Nguyen ◽

Thanh Duc Ngo ◽

Minh-Triet Tran ◽

Duy-Dinh Le ◽

Duc Anh Duong

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Spatial Information ◽

Real Life ◽

Inverted Index ◽

Bag Of Words ◽

Visual Words ◽

Benchmark Datasets ◽

Large Scale Image Retrieval ◽

Inverted Indexing

Download Full-text

Large-scale image search with text for information retrieval

Journal of Innovations in Engineering Education ◽

10.3126/jiee.v4i1.35390 ◽

2021 ◽

Vol 4 (1) ◽

pp. 87-89

Author(s):

Janardan Bhatta

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Image Feature ◽

Image Search ◽

Search Results ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Text Features ◽

Text Query

Searching images in a large database is a major requirement in Information Retrieval Systems. Expecting image search results based on a text query is a challenging task. In this paper, we leverage the power of Computer Vision and Natural Language Processing in Distributed Machines to lower the latency of search results. Image pixel features are computed based on contrastive loss function for image search. Text features are computed based on the Attention Mechanism for text search. These features are aligned together preserving the information in each text and image feature. Previously, the approach was tested only in multilingual models. However, we have tested it in image-text dataset and it enabled us to search in any form of text or images with high accuracy.

Download Full-text

Application of Naive Bayes Classifier for Information Extraction

10.31219/osf.io/z7q2e ◽

2020 ◽

Author(s):

Nikhil Ranjan Nayak

Keyword(s):

Information Retrieval ◽

Web Search ◽

Naive Bayes ◽

Large Data ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Information Need ◽

Naïve Bayes Classifier ◽

Access To Books

Information retrieval (IR) is the activity of obtaining information resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals and other documents; stores and manages those documents. Web Search Engines are the most visible IR applications.It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Download Full-text

Multi-Agent-Based Information Retrieval System Using Information Scent in Query Log Mining for Effective Web Search

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch012 ◽

2018 ◽

pp. 266-291

Author(s):

Suruchi Chawla

Keyword(s):

Information Retrieval ◽

Web Search ◽

Multi Agent System ◽

Information Need ◽

Agent System ◽

Query Log ◽

Information Scent ◽

Multi Agent ◽

Log Mining ◽

Query Log Mining

This chapter explains the multi-agent system for effective information retrieval using information scent in query log mining. The precision of search results is low due to difficult to infer the information need of the small size search query and therefore information need of the user is not satisfied effectively. Information Scent is used for modeling the information need of user web search session and clustering is performed to identify the similar information need sessions. Hyper Link-Induced Topic Search (HITS) is executed on clusters to generate the Hubs and authorities for web page recommendations to users who search with similar intents. This multi-agent system based on clustered query sessions uses query operations like expansion and recommendation to infer the information need of user search queries and recommends Hubs and authorities for effective web search.

Download Full-text

Tuning to Real-Life Language Statistics: Online Processing of Multiword Sequences in Native and Non-Native Speakers Across Language Registers

10.31234/osf.io/v4tbu ◽

2020 ◽

Author(s):

Elma Kerz ◽

Daniel Wiechmann ◽

Felicity Frinsel ◽

Morten H. Christiansen

Keyword(s):

Language Processing ◽

Large Scale ◽

Native Speakers ◽

Large Body ◽

Real Life ◽

Learning Mechanisms ◽

Online Processing ◽

Real Language ◽

Language Statistics ◽

Language Input

A large body of research over the past two decades has demonstrated that children and adults are equipped with statistical learning mechanisms that facilitate their language processing and boost their acquisition. However, this research has been conducted primarily using artificial languages that are highly simplified relative to real language input. Here, we aimed to determine to what extent adult native and non-native speakers show sensitivity to real-life language statistics obtained from large-scale analyses of authentic language use. Through a within-subject design, we conducted a series of behavioral experiments geared towards assessing the sensitivity to two types of distributional statistics (frequency and entropy) during online processing of multiword sequences across four registers of English (spoken, fiction, news and academic language). Our results show that both native and non-native speakers are able to `tune to' multiple distributional statistics inherent in different types of real language input.

Download Full-text

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/508 ◽

2020 ◽

Author(s):

Juntao Li ◽

Ruidan He ◽

Hai Ye ◽

Hwee Tou Ng ◽

Lidong Bing ◽

...

Keyword(s):

Language Processing ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Low Resource ◽

Performance Improvements ◽

Domain Specific ◽

High Resource ◽

Significant Performance ◽

Cross Lingual

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.

Download Full-text

Dagstuhl seminar 19461 on conversational search

ACM SIGIR Forum ◽

10.1145/3451964.3451967 ◽

2020 ◽

Vol 54 (1) ◽

pp. 1-11

Author(s):

Avishek Anand ◽

Lawrence Cavedon ◽

Matthias Hagen ◽

Hideo Joho ◽

Mark Sanderson ◽

...

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Human Computer Interaction ◽

Language Processing ◽

Research Agenda ◽

Web Search ◽

Dialogue Systems ◽

Future Directions ◽

Working Groups

In the week of November 10--15, 2019, 44 researchers from the fields of information retrieval and Web search, natural language processing, human computer interaction, and dialogue systems met for the Dagstuhl Seminar 19461 "Conversational Search" to share the latest development in the area of conversational search and discuss its research agenda and future directions. The clear signal from the seminar is that research opportunities to advance conversational search are available to many areas and that collaboration in an interdisciplinary community is essential to achieve the goals. This report overviews the program and selected findings of the working groups.

Download Full-text