inverted indexes
Recently Published Documents


TOTAL DOCUMENTS

42
(FIVE YEARS 10)

H-INDEX

7
(FIVE YEARS 2)

2022 ◽  
Vol 40 (1) ◽  
pp. 1-32
Author(s):  
Joel Mackenzie ◽  
Matthias Petri ◽  
Alistair Moffat

Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top- disjunctions but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency that is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited to comply with SLA requirements.


2021 ◽  
Author(s):  
Clement Agret ◽  
Bastien Cazaux ◽  
Antoine Limasset

Motivation: To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results: We present NIQKI, a novel structure using well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe that this approach can lead to tremendous improvement allowing fast query, scaling on extensive genomic databases. Availability and implementation: We wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/ NIQKI. It is designed as a user-friendly tool and comes along with usage sample


Author(s):  
Anirban Mondal ◽  
Ayaan Kakkar ◽  
Nilesh Padhariya ◽  
Mukesh Mohania

AbstractNext-generation enterprise management systems are beginning to be developed based on the Systems of Engagement (SOE) model. We visualize an SOE as a set of entities. Each entity is modeled by a single parent document with dynamic embedded links (i.e., child documents) that contain multi-modal information about the entity from various networks. Since entities in an SOE are generally queried using keywords, our goal is to efficiently retrieve the top-k entities related to a given keyword-based query by considering the relevance scores of both their parent and child documents. Furthermore, we extend the afore-mentioned problem to incorporate the case where the entities are geo-tagged. The main contributions of this work are three-fold. First, it proposes an efficient bitmap-based approach for quickly identifying the candidate set of entities, whose parent documents contain all queried keywords. A variant of this approach is also proposed to reduce memory consumption by exploiting skews in keyword popularity. Second, it proposes the two-tier HI-tree index, which uses both hashing and inverted indexes, for efficient document relevance score lookups. Third, it proposes an R-tree-based approach to extend the afore-mentioned approaches for the case where the entities are geo-tagged. Fourth, it performs comprehensive experiments with both real and synthetic datasets to demonstrate that our proposed schemes are indeed effective in providing good top-k result recall performance within acceptable query response times.


2021 ◽  
Author(s):  
Flavio Jaime Pol Gonçalves ◽  
Vinicius Cleves de Oliveira Carmo ◽  
Vinicius Toquetti de Melo ◽  
Rodrigo da Silva Cunha ◽  
Ismael H. F. Santos ◽  
...  

Abstract This paper presents a computing pipeline architecture for semantic search in the domain of Offshore Engineering. The proposed system combines modules such as document retriever, passage retriever, and answer extractor to produce textual responses to queries in natural language such as: “What FPSO motion is mostly affected by viscous damping?” Such responses are often needed in Offshore Engineering activities, and linguistic techniques such as those based on inverted indexes with a syntactic focus tend to perform poorly. Instead, this research explores semantic techniques that take into account the meaning of words in the domain of Offshore Engineering. This paper describes a Linguistic QA pipeline architecture built that provides a way to retrieve answers instantly from a collection of 13,000 unstructured technical documents about Offshore Engineering, reports the achieved results and future work. This paper also presents additional modules under construction that exploit Neural Networks and ontologies approaches for semantic search in the domain of Offshore Engineering.


Author(s):  
A.B. Veretennikov

The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes.


Author(s):  
Joel Mackenzie ◽  
Antonio Mallia ◽  
Matthias Petri ◽  
J. Shane Culpepper ◽  
Torsten Suel

Sign in / Sign up

Export Citation Format

Share Document