inverted indexes Latest Research Papers

Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top- disjunctions but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency that is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited to comply with SLA requirements.

Download Full-text

Toward optimal fingerprint indexing for large scale genomics

10.1101/2021.11.04.467355 ◽

2021 ◽

Author(s):

Clement Agret ◽

Bastien Cazaux ◽

Antoine Limasset

Keyword(s):

Large Scale ◽

State Of The Art ◽

False Positive Rate ◽

Bacterial Genomes ◽

Genomic Databases ◽

Novel Structure ◽

Fingerprint Indexing ◽

Positive Rate ◽

Inverted Indexes ◽

User Friendly

Motivation: To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results: We present NIQKI, a novel structure using well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe that this approach can lead to tremendous improvement allowing fast query, scaling on extensive genomic databases. Availability and implementation: We wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/ NIQKI. It is designed as a user-friendly tool and comes along with usage sample

Download Full-text

Efficient Indexing of Top-k Entities in Systems of Engagement with Extensions for Geo-tagged Entities

Data Science and Engineering ◽

10.1007/s41019-021-00173-1 ◽

2021 ◽

Author(s):

Anirban Mondal ◽

Ayaan Kakkar ◽

Nilesh Padhariya ◽

Mukesh Mohania

Keyword(s):

Response Times ◽

Recall Performance ◽

Single Parent ◽

Management Systems ◽

Enterprise Management ◽

Memory Consumption ◽

Tree Index ◽

Synthetic Datasets ◽

Inverted Indexes ◽

Candidate Set

AbstractNext-generation enterprise management systems are beginning to be developed based on the Systems of Engagement (SOE) model. We visualize an SOE as a set of entities. Each entity is modeled by a single parent document with dynamic embedded links (i.e., child documents) that contain multi-modal information about the entity from various networks. Since entities in an SOE are generally queried using keywords, our goal is to efficiently retrieve the top-k entities related to a given keyword-based query by considering the relevance scores of both their parent and child documents. Furthermore, we extend the afore-mentioned problem to incorporate the case where the entities are geo-tagged. The main contributions of this work are three-fold. First, it proposes an efficient bitmap-based approach for quickly identifying the candidate set of entities, whose parent documents contain all queried keywords. A variant of this approach is also proposed to reduce memory consumption by exploiting skews in keyword popularity. Second, it proposes the two-tier HI-tree index, which uses both hashing and inverted indexes, for efficient document relevance score lookups. Third, it proposes an R-tree-based approach to extend the afore-mentioned approaches for the case where the entities are geo-tagged. Fourth, it performs comprehensive experiments with both real and synthetic datasets to demonstrate that our proposed schemes are indeed effective in providing good top-k result recall performance within acceptable query response times.

Download Full-text

Learning Passage Impacts for Inverted Indexes

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval ◽

10.1145/3404835.3463030 ◽

2021 ◽

Author(s):

Antonio Mallia ◽

Omar Khattab ◽

Torsten Suel ◽

Nicola Tonellotto

Keyword(s):

Inverted Indexes

Download Full-text

Semantic Search in Offshore Engineering With Linguistics And Neural Processing Pipelines

10.1115/omae2021-62979 ◽

2021 ◽

Author(s):

Flavio Jaime Pol Gonçalves ◽

Vinicius Cleves de Oliveira Carmo ◽

Vinicius Toquetti de Melo ◽

Rodrigo da Silva Cunha ◽

Ismael H. F. Santos ◽

...

Keyword(s):

Neural Networks ◽

Semantic Search ◽

Viscous Damping ◽

Neural Processing ◽

Pipeline Architecture ◽

Offshore Engineering ◽

Semantic Techniques ◽

Inverted Indexes ◽

Under Construction ◽

Future Work

Abstract This paper presents a computing pipeline architecture for semantic search in the domain of Offshore Engineering. The proposed system combines modules such as document retriever, passage retriever, and answer extractor to produce textual responses to queries in natural language such as: “What FPSO motion is mostly affected by viscous damping?” Such responses are often needed in Offshore Engineering activities, and linguistic techniques such as those based on inverted indexes with a syntactic focus tend to perform poorly. Instead, this research explores semantic techniques that take into account the meaning of words in the domain of Offshore Engineering. This paper describes a Linguistic QA pipeline architecture built that provides a way to retrieve answers instantly from a collection of 13,000 unstructured technical documents about Offshore Engineering, reports the achieved results and future work. This paper also presents additional modules under construction that exploit Neural Networks and ontologies approaches for semantic search in the domain of Offshore Engineering.

Download Full-text

Relevance ranking for proximity full-text search based on additional indexes with multi-component keys

Vestnik Udmurtskogo Universiteta Matematika Mekhanika Komp yuternye Nauki ◽

10.35634/vm210110 ◽

2021 ◽

Vol 31 (1) ◽

pp. 132-148

Author(s):

A.B. Veretennikov

Keyword(s):

Full Text ◽

Search Query ◽

Text Search ◽

Relevance Ranking ◽

Full Text Search ◽

Search Results ◽

Search Speed ◽

Speed Up ◽

Inverted Indexes

The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes.

Download Full-text