Evolving an Algorithm to Generate Sparse Inverted Index Using Hadoop and Pig

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Techniques for Inverted Index Compression

ACM Computing Surveys ◽

10.1145/3415148 ◽

2020 ◽

Vol 53 (6) ◽

pp. 1-36

Author(s):

Giulio Ermanno Pibiri ◽

Rossano Venturini

Keyword(s):

Inverted Index ◽

Index Compression ◽

Inverted Index Compression

Download Full-text

Fast Reused Function Retrieval Method Based on Simhash and Inverted Index

2016 IEEE Trustcom/BigDataSE/ISPA ◽

10.1109/trustcom.2016.0159 ◽

2016 ◽

Cited By ~ 2

Author(s):

Yanchen Qiao ◽

Xiaochun Yun ◽

Yongzheng Zhang

Keyword(s):

Inverted Index ◽

Retrieval Method

Download Full-text

Web Page Personalization and link prediction using generalized inverted index and flame clustering

2016 International Conference on Computer Communication and Informatics (ICCCI) ◽

10.1109/iccci.2016.7479983 ◽

2016 ◽

Cited By ~ 3

Author(s):

A. Vinupriya ◽

S. Gomathi

Keyword(s):

Link Prediction ◽

Inverted Index ◽

Web Page

Download Full-text

Inverted Index Support for Numeric Search

Internet Mathematics ◽

10.1080/15427951.2006.10129119 ◽

2006 ◽

Vol 3 (2) ◽

pp. 153-185 ◽

Cited By ~ 1

Author(s):

Marcus Fontoura ◽

Ronny Lempel ◽

Runping Qi ◽

Jason Zien

Keyword(s):

Inverted Index

Download Full-text

Research on text classification based on inverted-index in cloud computing data center

Network Security and Communication Engineering ◽

10.1201/b18660-41 ◽

2015 ◽

pp. 183-188 ◽

Cited By ~ 1

Keyword(s):

Cloud Computing ◽

Text Classification ◽

Data Center ◽

Inverted Index

Download Full-text

HII: Histogram Inverted Index For Fast Images Retrieval

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i5.pp3140-3148 ◽

2018 ◽

Vol 8 (5) ◽

pp. 3140 ◽

Cited By ~ 1

Author(s):

Yuda Munarko ◽

Agus Eko Minarno

Keyword(s):

Real Number ◽

Text Retrieval ◽

Title Page ◽

Inverted Index ◽

Index Structure ◽

Indexing Structure

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>This work aims to improve the speed of search by creating an indexing structure in CBIR system. We utilised an inverted index structure that usually used in text retrieval with a modification. The modified inverted index is built based on histogram data that generated using Multi Texton Histogram (MTH) and Multi Texton Co-Occurrence Descriptor (MTCD) from 10,000 images of Corel dataset. When building the inverted index, we normalised value of each feature into a real number and considered pairs of feature and value that owned by a particular number of images. Based on our investigation, on MTCD histogram of 5,000 data test, we found that by considering histogram variable values which owned by maximum 12% of images, the number of comparison for each query can be reduced by 67.47% in a rate, the precision is 82.2%, and the rate of access to disk is 32.83%. Furthermore, we named our approach as Histogram Inverted Index (HII). </span></p></div></div></div></div></div></div>

Download Full-text

A multi-keyword parallel ciphertext retrieval scheme based on inverted index under the robot distributed system

ICST Transactions on Scalable Information Systems ◽

10.4108/eai.17-12-2021.172438 ◽

2021 ◽

pp. 172438

Author(s):

Jiyue Wang ◽

Xi Zhang ◽

Yonggang Zhu

Keyword(s):

Distributed System ◽

Inverted Index ◽

Retrieval Scheme

Download Full-text

The Performance of Boolean Retrieval and Vector Space Model in Textual Information Retrieval

CommIT (Communication and Information Technology) Journal ◽

10.21512/commit.v11i1.2108 ◽

2017 ◽

Vol 11 (1) ◽

pp. 33 ◽

Cited By ~ 1

Author(s):

Budi Yulianto ◽

Widodo Budiharto ◽

Iman Herwidiana Kartowisastro

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Experimental Results ◽

Inverted Index ◽

Exact Results ◽

Textual Information ◽

Space Model ◽

Corpus Data

Boolean Retrieval (BR) and Vector Space Model (VSM) are very popular methods in information retrieval for creating an inverted index and querying terms. BR method searches the exact results of the textual information retrieval without ranking the results. VSM method searches and ranks the results. This study empirically compares the two methods. The research utilizes a sample of the corpus data obtained from Reuters. The experimental results show that the required times to produce an inverted index by the two methods are nearly the same. However, a difference exists on the querying index. The results also show that the numberof generated indexes, the sizes of the generated files, and the duration of reading and searching an index are proportional with the file number in the corpus and thefile size.

Download Full-text