scholarly journals Efficient repeat finding in sets of strings via suffix arrays

2013 ◽  
Vol Vol. 15 no. 2 (Discrete Algorithms) ◽  
Author(s):  
Pablo Barenbaum ◽  
Verónica Becher ◽  
Alejandro Deymonnaz ◽  
Melisa Halsband ◽  
Pablo Ariel Heiber

Discrete Algorithms International audience We consider two repeat finding problems relative to sets of strings: (a) Find the largest substrings that occur in every string of a given set; (b) Find the maximal repeats in a given string that occur in no string of a given set. Our solutions are based on the suffix array construction, requiring O(m) memory, where m is the length of the longest input string, and O(n &log;m) time, where n is the the whole input size (the sum of the length of each string in the input). The most expensive part of our algorithms is the computation of several suffix arrays. We give an implementation and experimental results that evidence the efficiency of our algorithms in practice, even for very large inputs.

2020 ◽  
Author(s):  
Ekaterina Benza ◽  
Shmuel T Klein ◽  
Dana Shapira

Abstract An alternative to compressed suffix arrays is introduced, based on representing a sequence of integers using Fibonacci encodings, thereby reducing the space requirements of state-of-the-art implementations of the suffix array, while retaining the searching functionalities. Empirical tests support the theoretical space complexity improvements and show that there is no deterioration in the processing times.


2015 ◽  
Vol Vol. 17 no. 1 (Discrete Algorithms) ◽  
Author(s):  
Gregory R. Maloney

Discrete Algorithms International audience A method is described for constructing, with computer assistance, planar substitution tilings that have n-fold rotational symmetry. This method uses as prototiles the set of rhombs with angles that are integer multiples of pi/n, and includes various special cases that have already been constructed by hand for low values of n. An example constructed by this method for n = 11 is exhibited; this is the first substitution tiling with elevenfold symmetry appearing in the literature.


2015 ◽  
Vol Vol. 17 no. 1 (Discrete Algorithms) ◽  
Author(s):  
Hossein Ghasemalizadeh ◽  
Mohammadreza Razzazi

Discrete Algorithms International audience In this paper we devise some output sensitive algorithms for a problem where a set of points and a positive integer, m, are given and the goal is to cover a maximal number of these points with m disks. We introduce a parameter, ρ, as the maximum number of points that one disk can cover and we analyse the algorithms based on this parameter. At first, we solve the problem for m=1 in O(nρ) time, which improves the previous O(n2) time algorithm for this problem. Then we solve the problem for m=2 in O(nρ + 3 log ρ) time, which improves the previous O(n3 log n) algorithm for this problem. Our algorithms outperform the previous algorithms because ρ is much smaller than n in many cases. Finally, we extend the algorithm for any value of m and solve the problem in O(mnρ + (mρ)2m - 1 log mρ) time. The previous algorithm for this problem runs in O(n2m - 1 log n) time and our algorithm usually runs faster than the previous algorithm because mρ is smaller than n in many cases. We obtain output sensitive algorithms by confining the areas that we should search for the result. The techniques used in this paper may be applicable in other covering problems to obtain faster algorithms.


2015 ◽  
Vol Vol. 17 no. 1 (Discrete Algorithms) ◽  
Author(s):  
Sergio Cabello ◽  
Maria Saumell

Discrete Algorithms International audience We present a randomized algorithm to compute a clique of maximum size in the visibility graph G of the vertices of a simple polygon P. The input of the problem consists of the visibility graph G, a Hamiltonian cycle describing the boundary of P, and a parameter δ∈(0,1) controlling the probability of error of the algorithm. The algorithm does not require the coordinates of the vertices of P. With probability at least 1-δ the algorithm runs in O( |E(G)|2 / ω(G) log(1/δ)) time and returns a maximum clique, where ω(G) is the number of vertices in a maximum clique in G. A deterministic variant of the algorithm takes O(|E(G)|2) time and always outputs a maximum size clique. This compares well to the best previous algorithm by Ghosh et al. (2007) for the problem, which is deterministic and runs in O(|V(G)|2 |E(G)|) time.


2012 ◽  
Vol 263-266 ◽  
pp. 1398-1401
Author(s):  
Song Feng Lu ◽  
Hua Zhao

Document retrieval is the basic task of search engines, and seize amount of attention by the pattern matching community. In this paper, we focused on the dynamic version of this problem, in which the text insertion and deletion is allowable. By using the generalized suffix array and other data structure, we proposed a new index structure. Our scheme achieved better time complexity than the existing ones, and a bit more space overhead is needed as return.


2021 ◽  
Vol 11 (2) ◽  
pp. 283-302
Author(s):  
Paul Meurer

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.


2013 ◽  
Vol Vol. 15 no. 1 (Discrete Algorithms) ◽  
Author(s):  
Andrew R. Curtis ◽  
Min Chih Lin ◽  
Ross M. Mcconnell ◽  
Yahav Nussbaum ◽  
Francisco Juan Soulignac ◽  
...  

Discrete Algorithms International audience We give a linear-time algorithm that checks for isomorphism between two 0-1 matrices that obey the circular-ones property. Our algorithm is similar to the isomorphism algorithm for interval graphs of Lueker and Booth, but works on PC trees, which are unrooted and have a cyclic nature, rather than with PQ trees, which are rooted. This algorithm leads to linear-time isomorphism algorithms for related graph classes, including Helly circular-arc graphs, Γ circular-arc graphs, proper circular-arc graphs and convex-round graphs.


2014 ◽  
Vol Vol. 16 no. 3 (Discrete Algorithms) ◽  
Author(s):  
Konstanty Junosza-Szaniawski ◽  
Pawel Rzazewski

Discrete Algorithms International audience The generalized list T-coloring is a common generalization of many graph coloring models, including classical coloring, L(p,q)-labeling, channel assignment and T-coloring. Every vertex from the input graph has a list of permitted labels. Moreover, every edge has a set of forbidden differences. We ask for a labeling of vertices of the input graph with natural numbers, in which every vertex gets a label from its list of permitted labels and the difference of labels of the endpoints of each edge does not belong to the set of forbidden differences of this edge. In this paper we present an exact algorithm solving this problem, running in time O*((τ+2)n), where τ is the maximum forbidden difference over all edges of the input graph and n is the number of its vertices. Moreover, we show how to improve this bound if the input graph has some special structure, e.g. a bounded maximum degree, no big induced stars or a perfect matching.


2009 ◽  
Vol DMTCS Proceedings vol. AK,... (Proceedings) ◽  
Author(s):  
Konstantinos Panagiotou

International audience This work is devoted to the study of typical properties of random graphs from classes with structural constraints, like for example planar graphs, with the additional restriction that the average degree is fixed. More precisely, within a general analytic framework, we provide sharp concentration results for the number of blocks (maximal biconnected subgraphs) in a random graph from the class in question. Among other results, we discover that essentially such a random graph belongs with high probability to only one of two possible types: it either has blocks of at most logarithmic size, or there is a \emphgiant block that contains linearly many vertices, and all other blocks are significantly smaller. This extends and generalizes the results in the previous work [K. Panagiotou and A. Steger. Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '09), pp. 432-440, 2009], where similar statements were shown without the restriction on the average degree.


2012 ◽  
Vol Vol. 14 no. 1 (Discrete Algorithms) ◽  
Author(s):  
Zbigniew Lonc ◽  
Pawel Naroski

Discrete Algorithms International audience By an Euler walk in a 3-uniform hypergraph H we mean an alternating sequence v(0), epsilon(1), v(1), epsilon(2), v(2), ... , v(m-1), epsilon(m), v(m) of vertices and edges in H such that each edge of H appears in this sequence exactly once and v(i-1); v(i) is an element of epsilon(i), v(i-1) not equal v(i), for every i = 1, 2, ... , m. This concept is a natural extension of the graph theoretic notion of an Euler walk to the case of 3-uniform hypergraphs. We say that a 3-uniform hypergraph H is strongly connected if it has no isolated vertices and for each two edges e and f in H there is a sequence of edges starting with e and ending with f such that each two consecutive edges in this sequence have two vertices in common. In this paper we give an algorithm that constructs an Euler walk in a strongly connected 3-uniform hypergraph (it is known that such a walk in such a hypergraph always exists). The algorithm runs in time O(m), where m is the number of edges in the input hypergraph.


Sign in / Sign up

Export Citation Format

Share Document