Smaller Compressed Suffix Arrays†

2020 ◽  
Author(s):  
Ekaterina Benza ◽  
Shmuel T Klein ◽  
Dana Shapira

Abstract An alternative to compressed suffix arrays is introduced, based on representing a sequence of integers using Fibonacci encodings, thereby reducing the space requirements of state-of-the-art implementations of the suffix array, while retaining the searching functionalities. Empirical tests support the theoretical space complexity improvements and show that there is no deterioration in the processing times.

2021 ◽  
Vol 25 (2) ◽  
pp. 283-303
Author(s):  
Na Liu ◽  
Fei Xie ◽  
Xindong Wu

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.


2021 ◽  
Vol 14 (11) ◽  
pp. 2599-2612
Author(s):  
Nikolaos Tziavelis ◽  
Wolfgang Gatterbauer ◽  
Mirek Riedewald

We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k , the k top-ranked answers are returned in O ( n polylog n + k log k ) time. This is within a polylogarithmic factor of O ( n + k log k ), i.e., the best known complexity for equi-joins, and even of O ( n + k ), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is n ℓ for a join of ℓ relations. The key ingredient is a novel O ( n polylog n )-size factorized representation of the query output , which is constructed on-the-fly for a given query and database. In addition to providing the first nontrivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.


2013 ◽  
Vol Vol. 15 no. 2 (Discrete Algorithms) ◽  
Author(s):  
Pablo Barenbaum ◽  
Verónica Becher ◽  
Alejandro Deymonnaz ◽  
Melisa Halsband ◽  
Pablo Ariel Heiber

Discrete Algorithms International audience We consider two repeat finding problems relative to sets of strings: (a) Find the largest substrings that occur in every string of a given set; (b) Find the maximal repeats in a given string that occur in no string of a given set. Our solutions are based on the suffix array construction, requiring O(m) memory, where m is the length of the longest input string, and O(n &log;m) time, where n is the the whole input size (the sum of the length of each string in the input). The most expensive part of our algorithms is the computation of several suffix arrays. We give an implementation and experimental results that evidence the efficiency of our algorithms in practice, even for very large inputs.


10.29007/t3jg ◽  
2018 ◽  
Author(s):  
David Cerna ◽  
Wolfgang Schreiner

In previous work we presented an algorithmic procedure for analysing the space complexity of monitor specifications written in a fragment of predicate logic. These monitor specifications were developed for runtime monitoring of event streams. Our procedure provides accurate results for a large fragment of the possible specifications, but overestimates the space complexity of precisely those specifications which are more likely to be found in real world applications. Experiments hinted at a relationship between the extent our procedure over-approximates the space requirements of a specification and the quantifier structure of the specification. In this paper we provide a formalization of this relationship as approximation ratios, and are able to pinpoint ``good'' constructions, that is specifications using less memory. These results are first steps towards categorizing specifications based on memory efficiency.


2012 ◽  
Vol 263-266 ◽  
pp. 1398-1401
Author(s):  
Song Feng Lu ◽  
Hua Zhao

Document retrieval is the basic task of search engines, and seize amount of attention by the pattern matching community. In this paper, we focused on the dynamic version of this problem, in which the text insertion and deletion is allowable. By using the generalized suffix array and other data structure, we proposed a new index structure. Our scheme achieved better time complexity than the existing ones, and a bit more space overhead is needed as return.


2014 ◽  
Vol 14 (4-5) ◽  
pp. 509-524 ◽  
Author(s):  
ROBERTO AMADINI ◽  
MAURIZIO GABBRIELLI ◽  
JACOPO MAURO

AbstractWithin the context of constraint solving, a portfolio approach allows one to exploit the synergy between different solvers in order to create a globally better solver. In this paper we present SUNNY: a simple and flexible algorithm that takes advantage of a portfolio of constraint solvers in order to compute — without learning an explicit model — a schedule of them for solving a given Constraint Satisfaction Problem (CSP). Motivated by the performance reached by SUNNY vs. different simulations of other state of the art approaches, we developedsunny-csp, an effective portfolio solver that exploits the underlying SUNNY algorithm in order to solve a given CSP. Empirical tests conducted on exhaustive benchmarks of MiniZinc models show that the actual performance ofsunny-cspconforms to the predictions. This is encouraging both for improving the power of CSP portfolio solvers and for trying to export them to fields such as Answer Set Programming and Constraint Logic Programming.


2021 ◽  
Vol 11 (2) ◽  
pp. 283-302
Author(s):  
Paul Meurer

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.


Author(s):  
Robinson Jimenez-Moreno ◽  
Astrid Rubiano Fonseca ◽  
Jose Luis Ramirez

This paper exposes the use of recent deep learning techniques in the state of the art, little addressed in robotic applications, where a new algorithm based on Faster R-CNN and CNN regression is exposed. The machine vision systems implemented, tend to require multiple stages to locate an object and allow a robot to take it, increasing the noise in the system and the processing times. The convolutional networks based on regions allow one to solve this problem, it is used for it two convolutional architectures, one for classification and location of three types of objects and one to determine the grip angle for a robotic gripper. Under the establish virtual environment, the grip algorithm works up to 5 frames per second with a 100% object classification, and with the implementation of the Faster R-CNN, it allows obtain 100% accuracy in the classifications of the test database, and over a 97% of average precision locating the generated boxes in each element, gripping successfully the objects.


Author(s):  
Subhajit Sanfui ◽  
Deepak Sharma

Abstract This paper presents an efficient strategy to perform the assembly stage of finite element analysis (FEA) on general-purpose graphics processing units (GPU). This strategy involves dividing the assembly task by using symbolic and numeric kernels, and thereby reducing the complexity of the standard single-kernel assembly approach. Two sparse storage formats based on the proposed strategy are also developed by modifying the existing sparse storage formats with the intention of removing the degrees of freedom-based redundancies in the global matrix. The inherent problem of race condition is resolved through the implementation of coloring and atomics. The proposed strategy is compared with the state-of-the-art GPU-based and CPU-based assembly techniques. These comparisons reveal a significant number of benefits in terms of reducing storage space requirements and execution time and increasing performance (GFLOPS). Moreover, using the proposed strategy, it is found that the coloring method is more effective compared to the atomics-based method for the existing as well as the modified storage formats.


2009 ◽  
Vol 20 (06) ◽  
pp. 1109-1133 ◽  
Author(s):  
JIE LIN ◽  
YUE JIANG ◽  
DON ADJEROH

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.


Sign in / Sign up

Export Citation Format

Share Document