Pattern Matching on Grammar-Compressed Strings in Linear Time

Let [Formula: see text] be a string, with symbols from an alphabet. [Formula: see text] is said to be degenerate if for some positions, say [Formula: see text], [Formula: see text] can contain a subset of symbols from the symbol alphabet, rather than just one symbol. Given a text string [Formula: see text] and a pattern [Formula: see text], both with symbols from an alphabet [Formula: see text], the degenerate string matching problem, is to find positions in [Formula: see text] where [Formula: see text] occured, such that [Formula: see text], [Formula: see text], or both are allowed to be degenerate. Though some algorithms have been proposed, their huge computational cost pose a significant challenge to their practical utilization. In this work, we propose IDPM, an improved degenerate pattern matching algorithm based on an extension of the Boyer–Moore algorithm. At the preprocessing phase, the algorithm defines an alphabet-independent compatibility rule, and computes the shift arrays using respective variants of the bad character and good suffix heuristics. At the search phase, IDPM improves the matching speed by using the compatibility rule. On average, the proposed IDPM algorithm has a linear time complexity with respect to the text size, and to the overall size of the pattern. IDPM demonstrates significance performance improvement over state-of-the-art approaches. It can be used in fast practical degenerate pattern matching with large data sizes, with important applications in flexible and scalable searching of huge biological sequences.

Download Full-text

Linear Time Distances Between Fuzzy Sets With Applications to Pattern Matching and Classification

IEEE Transactions on Image Processing ◽

10.1109/tip.2013.2286904 ◽

2014 ◽

Vol 23 (1) ◽

pp. 126-136 ◽

Cited By ~ 17

Author(s):

Joakim Lindblad ◽

Natasa Sladoje

Keyword(s):

Fuzzy Sets ◽

Pattern Matching ◽

Linear Time

Download Full-text

Permutation Pattern matching in (213, 231)-avoiding permutations

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.1329 ◽

2017 ◽

Vol Vol. 18 no. 2, Permutation... (Permutation Patterns) ◽

Author(s):

Both Neou ◽

Romeo Rizzi ◽

Stéphane Vialette

Keyword(s):

Pattern Matching ◽

MONI: A Pangenomics Index for Finding MEMs

10.1101/2021.07.06.451246 ◽

2021 ◽

Author(s):

Massimiliano Rossi ◽

Marco Oliva ◽

Ben Langmead ◽

Travis Gagie ◽

Christina Boucher

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Repetitive Sequences ◽

Major Advance ◽

Human Chromosomes ◽

Time And Space ◽

Index Construction ◽

Approximate Pattern Matching ◽

Human Genomes ◽

Novel Algorithm

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to other read aligners -- PuffAligner, Bowtie2, BWA-MEM, and CHIC -- MONI used 2--11 times less memory and was 2--32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. Availability: MONI is publicly available at https://github.com/maxrossi91/moni.

Download Full-text

On the relationship between histogram indexing and block-mass indexing

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2013.0132 ◽

2014 ◽

Vol 372 (2016) ◽

pp. 20130132 ◽

Cited By ~ 3

Author(s):

Amihood Amir ◽

Ayelet Butman ◽

Ely Porat

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Matching Problem ◽

Open Problems ◽

Time Solution ◽

Text Length ◽

Active Research ◽

Histogram Indexing ◽

The Relationship ◽

Mass Pattern

Histogram indexing , also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o (| T | 2 /polylog| T |) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple linear-time solution. Block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass-spectrometry. It is also an example of a pattern matching problem that has an efficient, almost linear-time solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the block-mass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text T in time o (| T | 2 /polylog| T |) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a trade-off between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be pre-processed in time o (| T | 2 /polylog| T |) and enable linear-time query processing.

Download Full-text

SHOCK: A Worst-Case Ensured Sub-Linear Time Pattern Matching Algorithm for Inline Anti-Virus Scanning

2010 IEEE International Conference on Communications ◽

10.1109/icc.2010.5501986 ◽

2010 ◽

Cited By ~ 1

Author(s):

N.-F. Huang ◽

W.-Y. Tsai

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Worst Case ◽

Time Pattern ◽

Matching Algorithm ◽

Pattern Matching Algorithm

Download Full-text

Fast Partial Evaluation of Pattern Matching in Strings

BRICS Report Series ◽

10.7146/brics.v10i20.21790 ◽

2003 ◽

Vol 10 (20) ◽

Author(s):

Mads Sig Ager ◽

Olivier Danvy ◽

Henning Korsholm Rohde

Keyword(s):

Pattern Matching ◽

Open Problem ◽

Linear Time ◽

Partial Evaluation

We show how to obtain all of Knuth, Morris, and Pratt's linear-time string matcher by partial evaluation of a quadratic-time string matcher with respect to a pattern string. Although it has been known for 15 years how to obtain this linear matcher by partial evaluation of a quadratic one, how to obtain it in linear time has remained an open problem. Obtaining a linear matcher by partial evaluation of a quadratic one is achieved by performing its backtracking at specialization time and memoizing its results. We show (1) how to rewrite the source matcher such that its static intermediate computations can be shared at specialization time and (2) how to extend the memoization capabilities of a partial evaluator to static functions. Such an extended partial evaluator, if its memoization is implemented efficiently, specializes the rewritten source matcher in linear time. Supersedes BRICS-RS-03-11 and is superseded by BRICS-RS-04-40.

Download Full-text

A Note on Linear Time Simulation of Deterministic Two-Way Pushdown Automata

DAIMI Report Series ◽

10.7146/dpb.v6i75.6492 ◽

1977 ◽

Vol 6 (75) ◽

Author(s):

Neil D. Jones

Keyword(s):

Pattern Matching ◽

Data Structures ◽

Linear Time ◽

Random Access ◽

Simulation Algorithm ◽

Pushdown Automaton ◽

Matching Problems ◽

Time Result ◽

Time Simulation ◽

Pushdown Automata

Cook has shown that any deterministic two-way pushdown automaton could be simulated by a uniform-cost random access machine in time O(n) for inputs of length n. The result was of interest because such a machine is a natural model for a variety of backtracking algorithms, particularly as used in pattern matching problems. The linear time result was surprising because of the fact that such machines may run as many as 2n steps before halting; similar problems with 'combinatorial explosions' are well known to occur in applications of backtracking. Cook's result inspired the development of a number of efficient pattern matching algorithms.However, it is impractical to use Cook's algorithm directly to do pattern matching, since it involves a large constant time factor and much storage. The purpose of this note is to present an alternate, simpler simulation algorithm which involves consideration only of the configurations actually reached by the automaton. It can be expected to run faster and use less storage (depending on the data structures used), thus bringing Cook's result a step closer to practical utility.

Download Full-text