Linear Approximate Pattern Matching Algorithm

Mapping Intimacies ◽

10.21203/rs.3.rs-1021063/v1 ◽

2021 ◽

Author(s):

Anas Al-okaily ◽

Abdelghani Tbakhi

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Search Costs ◽

Exact Matching ◽

Time And Space ◽

Matching Problem ◽

Approximate Matching ◽

Large Length ◽

Reference Stream ◽

Inexact Matching

Abstract Pattern matching is a fundamental process in almost every scientific domain. The problem involves finding the positions of a given pattern (usually of short length) in a reference stream of data (usually of large length). The matching can be as an exact or as an approximate (inexact) matching. Exact matching is to search for the pattern without allowing for mismatches (or insertions and deletions) of one or more characters in the pattern), while approximate matching is the opposite. For exact matching, several data structures that can be built in linear time and space are used and in practice nowadays. For approximate matching, the solutions proposed to solve this matching are non-linear and currently impractical. In this paper, we designed and implemented a structure that can be built in linear time and space and solve the approximate matching problem in (O(m + {log_Σ^k}n/{k!} + occ) search costs, where m is the length of the pattern, n is the length of the reference, and k is the number of tolerated mismatches (and insertion and deletions).

Download Full-text

IDPM: An Improved Degenerate Pattern Matching Algorithm for Biological Sequences

International Journal of Foundations of Computer Science ◽

10.1142/s0129054117500307 ◽

2017 ◽

Vol 28 (07) ◽

pp. 889-914

Author(s):

Jie Lin ◽

Yue Jiang ◽

E. James Harner ◽

Bing-Hua Jiang ◽

Don Adjeroh

Keyword(s):

Performance Improvement ◽

Pattern Matching ◽

Linear Time ◽

Computational Cost ◽

Large Data ◽

Biological Sequences ◽

Matching Problem ◽

Practical Utilization ◽

Matching Algorithm ◽

Pattern Matching Algorithm

Let [Formula: see text] be a string, with symbols from an alphabet. [Formula: see text] is said to be degenerate if for some positions, say [Formula: see text], [Formula: see text] can contain a subset of symbols from the symbol alphabet, rather than just one symbol. Given a text string [Formula: see text] and a pattern [Formula: see text], both with symbols from an alphabet [Formula: see text], the degenerate string matching problem, is to find positions in [Formula: see text] where [Formula: see text] occured, such that [Formula: see text], [Formula: see text], or both are allowed to be degenerate. Though some algorithms have been proposed, their huge computational cost pose a significant challenge to their practical utilization. In this work, we propose IDPM, an improved degenerate pattern matching algorithm based on an extension of the Boyer–Moore algorithm. At the preprocessing phase, the algorithm defines an alphabet-independent compatibility rule, and computes the shift arrays using respective variants of the bad character and good suffix heuristics. At the search phase, IDPM improves the matching speed by using the compatibility rule. On average, the proposed IDPM algorithm has a linear time complexity with respect to the text size, and to the overall size of the pattern. IDPM demonstrates significance performance improvement over state-of-the-art approaches. It can be used in fast practical degenerate pattern matching with large data sizes, with important applications in flexible and scalable searching of huge biological sequences.

Download Full-text

Permutation Pattern matching in (213, 231)-avoiding permutations

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.1329 ◽

2017 ◽

Vol Vol. 18 no. 2, Permutation... (Permutation Patterns) ◽

Author(s):

Both Neou ◽

Romeo Rizzi ◽

Stéphane Vialette

Keyword(s):

Pattern Matching ◽

MONI: A Pangenomics Index for Finding MEMs

10.1101/2021.07.06.451246 ◽

2021 ◽

Author(s):

Massimiliano Rossi ◽

Marco Oliva ◽

Ben Langmead ◽

Travis Gagie ◽

Christina Boucher

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Repetitive Sequences ◽

Major Advance ◽

Human Chromosomes ◽

Time And Space ◽

Index Construction ◽

Approximate Pattern Matching ◽

Human Genomes ◽

Novel Algorithm

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to other read aligners -- PuffAligner, Bowtie2, BWA-MEM, and CHIC -- MONI used 2--11 times less memory and was 2--32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. Availability: MONI is publicly available at https://github.com/maxrossi91/moni.

Download Full-text

On the relationship between histogram indexing and block-mass indexing

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2013.0132 ◽

2014 ◽

Vol 372 (2016) ◽

pp. 20130132 ◽

Cited By ~ 3

Author(s):

Amihood Amir ◽

Ayelet Butman ◽

Ely Porat

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Matching Problem ◽

Open Problems ◽

Time Solution ◽

Text Length ◽

Active Research ◽

Histogram Indexing ◽

The Relationship ◽

Mass Pattern

Histogram indexing , also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about 6 years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o (| T | 2 /polylog| T |) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length. The pattern matching version of this problem has a simple linear-time solution. Block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass-spectrometry. It is also an example of a pattern matching problem that has an efficient, almost linear-time solution but whose indexing version is daunting. However, for fixed finite alphabets, there has been progress made. In this paper, a strong connection between the histogram indexing problem and the block-mass pattern indexing problem is shown. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. In addition, we show that for both these problems, even over unbounded alphabets, there are algorithms that preprocess a text T in time o (| T | 2 /polylog| T |) and enable answering indexing queries in time polynomial in the query length. The contributions of this paper are twofold: (i) we introduce the idea of allowing a trade-off between the preprocessing time and query time of various indexing problems that have been stumbling blocks in the literature. (ii) We take the first step in introducing a class of indexing problems that, we believe, cannot be pre-processed in time o (| T | 2 /polylog| T |) and enable linear-time query processing.

Download Full-text

Two Dimensional Matching

Pattern Matching Algorithms ◽

10.1093/oso/9780195113679.003.0012 ◽

1997 ◽

Author(s):

A. Amir ◽

M. Farach

Keyword(s):

Pattern Matching ◽

String Matching ◽

Higher Dimensions ◽

Natural Generalization ◽

Theoretical Problem ◽

Two Dimensional ◽

Exact Matching ◽

Matching Problem ◽

Deterministic Algorithms ◽

Special Case

String matching is a basic theoretical problem in computer science, but has been useful in implementating various text editing tasks. The explosion of multimedia requires an appropriate generalization of string matching to higher dimensions. The first natural generalization is that of seeking the occurrences of a pattern in a text where both pattern arid text are rectangles. The last few years saw a tremendous activity in two dimensional pattern matching algorithms. We naturally had to limit the amount of information that entered this chapter. We chose to concentrate on serial deterministic algorithms for some of the basic issues of two dimensional matching. Throughout this chapter we define our problems in terms of squares rather than rectangles, however, all results presented easily generalize to rectangles. The Exact Two Dimensional Matching Problem is defined as follows: . . . INPUT: Text array T[n x n] and pattern array P[m x m]. OUTPUT: All locations [i,j] in T where there is an occurrence of P, i.e. T[i+k+,j+l] = P[k+1,l+1] 0 ≤ k, l ≤ n-1. . . . A natural way of solving any generalized problem is by reducing it to a special case whose solution is known. It is therefore not surprising that most solutions to the two dimensional exact matching problem use exact string matching algorithms in one way or another. In this section, we present an algorithm for two dimensional matching which relies on reducing a matrix of characters into a one dimensional array. Let P' [1 . . .m] be a pattern which is derived from P by setting P' [i] = P[i,l]P[i,2]…P[i,m], that is, the ith character of P' is the ith row of P. Let Ti[l . . .n — m + 1], for 1 ≤ i ≤ n, be a set of arrays such that Ti[j] = T[i, j] T [ i , j + 1 ] • • • T[i, j + m-1]. Clearly, P occurs at T[i, j] iff P' occurs at Ti[j].

Download Full-text

INDEXING GAPPED-FACTORS USING A TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054108005541 ◽

2008 ◽

Vol 19 (01) ◽

pp. 71-87 ◽

Cited By ~ 3

Author(s):

PIERRE PETERLONGO ◽

JULIEN ALLALI ◽

MARIE-FRANCE SAGOT

Keyword(s):

Data Structure ◽

Pattern Matching ◽

Suffix Tree ◽

Linear Time ◽

Fixed Size ◽

Specific Kind ◽

Time And Space ◽

Inference Problems

We present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration.

Download Full-text

SIGMA: A SET-COVER-BASED INEXACT GRAPH MATCHING ALGORITHM

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001000477x ◽

2010 ◽

Vol 08 (02) ◽

pp. 199-218 ◽

Cited By ~ 47

Author(s):

MISAEL MONGIOVÌ ◽

RAFFAELE DI NATALE ◽

ROSALBA GIUGNO ◽

ALFREDO PULVIRENTI ◽

ALFREDO FERRO ◽

...

Keyword(s):

Graph Matching ◽

Isomorphism Problem ◽

Set Cover ◽

Exact Matching ◽

Matching Problem ◽

Set Cover Problem ◽

Graph Indexing ◽

Growing Domain ◽

Indexing Method ◽

Inexact Matching

Network querying is a growing domain with vast applications ranging from screening compounds against a database of known molecules to matching sub-networks across species. Graph indexing is a powerful method for searching a large database of graphs. Most graph indexing methods to date tackle the exact matching (isomorphism) problem, limiting their applicability to specific instances in which such matches exist. Here we provide a novel graph indexing method to cope with the more general, inexact matching problem. Our method, SIGMA, builds on approximating a variant of the set-cover problem that concerns overlapping multi-sets. We extensively test our method and compare it to a baseline method and to the state-of-the-art Grafil. We show that SIGMA outperforms both, providing higher pruning power in all the tested scenarios.

Download Full-text