Pattern Matching Algorithms
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780195113679, 9780197561133

Author(s):  
K. Zhang ◽  
D. Shasha

Most of this book is about stringology, the study of strings. So why this chapter on trees? Why not graphs or geometry or something else? First, trees generalize strings in a very direct sense: a string is simply a tree with a single leaf. This has the unsurprising consequence that many of our algorithms specialize to strings and the happy consequence that some of those algorithms are as efficient as the best string algorithms. From the point of view of “treeology”, there is the additional pragmatic advantage of this relationship between trees and strings: some techniques from strings carry over to trees, e.g., suffix trees, and others show promise though we don’t know of work that exploits it. So, treeology provides a good example area for applications of stringologic techniques. Second, some of our friends in stringology may wonder whether there is some easy reduction that can take any tree edit problem, map it to strings, solve it in the string domain and then map it back. We don’t believe there is, because, as you will see, tree editing seems inherently to have more data dependence than string editing. (Specifically, the dynamic programming approach to string editing is always a local operation depending on the left, upper, and upper left neighbor of a cell. In tree editing, the upper left neighbor is usually irrelevant — instead the relevant cell depends on the tree topology.) That is a belief not a theorem, so we would like to state right at the outset the key open problem of treeology: can all tree edit problems on ordered trees (trees where the order among the siblings matters) be reduced efficiently to string edit problems and back again?. The rest of this chapter proceeds on the assumption that this question has a negative response. In particular, we discuss the best known algorithms for tree editing and several variations having to do with subtree removal, variable length don’t cares, and alignment. We discuss both sequential and parallel algorithms.


Author(s):  
R. Giancarlo ◽  
R. Grossi

We discuss the suffix tree generalization to matrices in this chapter. We extend the suffix tree notion (described in Chapter 3) from text strings to text matrices whose entries are taken from an ordered alphabet with the aim of solving pattern-matching problems. This suffix tree generalization can be efficiently used to implement low-level routines for Computer Vision, Data Compression, Geographic Information Systems and Visual Databases. We examine the submatrices in the form of the text’s contiguous parts that still have a matrix shape. Representing these text submatrices as “suitably formatted” strings stored in a compacted trie is the rationale behind suffix trees for matrices. The choice of the format inevitably influences suffix tree construction time and space complexity. We first deal with square matrices and show that many suffix tree families can be defined for the same input matrix according to the matrix’s string representations. We can store each suffix tree in linear space and give an efficient construction algorithm whose input is both the matrix and the string representation chosen. We then treat rectangular matrices and define their corresponding suffix trees by means of some general rules which we list formally. We show that there is a super-linear lower bound to the space required (in contrast with the linear space required by suffix trees for square matrices). We give a simple example of one of these suffix trees. The last part of the chapter illustrates some technical results regarding suffix trees for square matrices: we show how to achieve an expected linear-time suffix tree construction for a constant-size alphabet under some mild probabilistic assumptions about the input distribution. We begin by defining a wide class of string representations for square matrices. We let Σ denote an ordered alphabet of characters and introduce another alphabet of five special characters, called shapes. A shape is one of the special characters taken from set {IN,SW,NW,SE,NE}. Shape IN encodes the 1x1 matrix generated from the empty matrix by creating a square.


Author(s):  
R. Giancarlo

In this Chapter we present some general algorithmic techniques that have proved to be useful in speeding up the computation of some families of dynamic programming recurrences which have applications in sequence alignment, paragraph formation and prediction of RNA secondary structure. The material presented in this chapter is related to the computation of Levenshtein distances and approximate string matching that have been discussed in the previous three chapters. Dynamic programming is a general technique for solving discrete optimization (minimization or maximization) problems that can be represented by decision processes and for which the principle of optimality holds. We can view a decision process as a directed graph in which nodes represent the states of the process and edges represent decisions. The optimization problem at hand is represented as a decision process by decomposing it into a set of subproblems of smaller size. Such recursive decomposition is continued until we get only trivial subproblems, which can be solved directly. Each node in the graph corresponds to a subproblem and each edge (a, b) indicates that one way to solve subproblem a optimally is to solve first subproblem b optimally. Then, an optimal solution, or policy, is typically given by a path on the graph that minimizes or maximizes some objective function. The correctness of this approach is guaranteed by the principle of optimality which must be satisfied by the optimization problem: An optimal policy has the property that whatever the initial node (state) and initial edge (decision) are, the remaining edges (decisions) must be an optimal policy with regard to the node (state) resulting from the first transition. Another consequence of the principle of optimality is that we can express the optimal cost (and solution) of a subproblem in terms of optimal costs (and solutions) of problems of smaller size. That is, we can express optimal costs through a recurrence relation. This is a key component of dynamic programming, since we can compute the optimal cost of a subproblem only once, store the result in a table, and look it up when needed.


Author(s):  
A. Amir ◽  
M. Farach

String matching is a basic theoretical problem in computer science, but has been useful in implementating various text editing tasks. The explosion of multimedia requires an appropriate generalization of string matching to higher dimensions. The first natural generalization is that of seeking the occurrences of a pattern in a text where both pattern arid text are rectangles. The last few years saw a tremendous activity in two dimensional pattern matching algorithms. We naturally had to limit the amount of information that entered this chapter. We chose to concentrate on serial deterministic algorithms for some of the basic issues of two dimensional matching. Throughout this chapter we define our problems in terms of squares rather than rectangles, however, all results presented easily generalize to rectangles. The Exact Two Dimensional Matching Problem is defined as follows: . . . INPUT: Text array T[n x n] and pattern array P[m x m]. OUTPUT: All locations [i,j] in T where there is an occurrence of P, i.e. T[i+k+,j+l] = P[k+1,l+1] 0 ≤ k, l ≤ n-1. . . . A natural way of solving any generalized problem is by reducing it to a special case whose solution is known. It is therefore not surprising that most solutions to the two dimensional exact matching problem use exact string matching algorithms in one way or another. In this section, we present an algorithm for two dimensional matching which relies on reducing a matrix of characters into a one dimensional array. Let P' [1 . . .m] be a pattern which is derived from P by setting P' [i] = P[i,l]P[i,2]…P[i,m], that is, the ith character of P' is the ith row of P. Let Ti[l . . .n — m + 1], for 1 ≤ i ≤ n, be a set of arrays such that Ti[j] = T[i, j] T [ i , j + 1 ] • • • T[i, j + m-1]. Clearly, P occurs at T[i, j] iff P' occurs at Ti[j].


Author(s):  
M. Li ◽  
T. Jiang

Given a finite set of strings S = {s1,...,sm}, the shortest common superstring of S, is the shortest string s such that each si appears as a substring (a consecutive block) of s. . . . Example. . . . . . . Assume we want to find the shortest common superstring of all words in the sentence “alf ate half lethal alpha alfalfa.” Our set of strings is S = { alf, ate, half, lethal, alpha, alfalfa }. A trivial superstring of S is “alfatehalflethalalphaalfalfa”, of length 28. A shortest common superstring is “lethalphalfalfate”, of length 17, saving 11 characters. The above example shows an application of the shortest common superstring problem in data compression. In many programming languages, a character string may be represented by a pointer to that string. The problem for the compiler is to arrange strings so that they may be “overlapped” as much as possible in order to save space. For more data compression related issues, see next chapter. Other than compressing a sentence about Alf, the shortest common superstring problem has more important applications in DNA sequencing. A DNA sequence may be considered as a long character string over the alphabet of nucleotides {A, C, G, T}. Such a character string ranges from a few thousand symbols long for a simple virus, to 2 x 108 symbols for a fly and 3 x 109 symbols for a human being. Determining this string for different molecules, or sequencing the molecules, is a crucial step towards understanding the biological functions of the molecules. In fact, today, no problem in biochemistry can be studied in isolation from its genetic background. However, with current laboratory methods, such as Sanger’s procedure, it is quite impossible to sequence a long molecule directly as a whole. Each time, a randomly chosen fragment of less than 500 base pairs can be sequenced. In general, biochemists “cut”, using different restriction enzymes, millions of such (identical) molecules into pieces each typically containing about 200-500 nucleotides (characters). A biochemist “samples” the fragments and Sanger’s procedure is applied to sequence the sampled fragment. . . .


Author(s):  
D.S. Hirschberg

In the previous chapters, we discussed problems involving an exact match of string patterns. We now turn to problems involving similar but not necessarily exact pattern matches. There are a number of similarity or distance measures, and many of them are special cases or generalizations of the Levenshtein metric. The problem of evaluating the measure of string similarity has numerous applications, including one arising in the study of the evolution of long molecules such as proteins. In this chapter, we focus on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance. The Levenshtein distance is a metric that measures the similarity of two strings. In its simple form, the Levenshtein distance, D(x , y), between strings x and y is the minimum number of character insertions and/or deletions (indels) required to transform string x into string y. A commonly used generalization of the Levenshtein distance is the minimum cost of transforming x into y when the allowable operations are character insertion, deletion, and substitution, with costs δ(λ , σ), δ(σ, λ), and δ(σ1, σ2) , that are functions of the involved character(s). There are direct correspondences between the Levenshtein distance of two strings, the length of the shortest edit sequence from one string to the other, and the length of the longest common subsequence (LCS) of those strings. If D is the simple Levenshtein distance between two strings having lengths m and n, SES is the length of the shortest edit sequence between the strings, and L is the length of an LCS of the strings, then SES = D and L = (m + n — D)/2. We will focus on the problem of determining the length of an LCS and also on the related problem of recovering an LCS. Another related problem, which will be discussed in Chapter 6, is that of approximate string matching, in which it is desired to locate all positions within string y which begin an approximation to string x containing at most D errors (insertions or deletions).


Author(s):  
G.M. Landau ◽  
U. Vishkin

Consider the string searching problem, where differences between characters of the pattern and characters of the text are allowed. Each difference is due to either a mismatch between a character of the text and a character of the pattern, or a superfluous character in the text, or a superfluous character in the pattern. Given a text of length n, a pattern of length m and an integer k, serial and parallel algorithms for finding all occurrences of the pattern in the text with at most k differences are presented. For completeness we also describe an efficient algorithm for preprocessing a rooted tree, so that queries requesting the lowest common ancestor of every pair of vertices in the tree can be processed quickly. Input form. Two arrays: A = a1., ...,am - the pattern, T = t1, ...,tn - the text and an integer k (≥ 1). In the present chapter we will be interested in finding all occurrences of the pattern string in the text string with at most k differences. Three types of differences are distinguished: (a) A character of the pattern corresponds to a different character of the text - a mismatch between the two characters. (Item 2 in Example 1, below.) (b) A character of the pattern corresponds to “no character” in the text. (Item 4). (c) A character of the text corresponds to “no character” in the pattern. (Item 6). Example 1. Let the text be abcdefghi , the pattern bxdyegh and k = 3. Let us see whether there is an occurrence with ≤ k differences that ends at the eighth location of the text. For this the following correspondence between bcdefgh and bxdyegh is proposed. 1. b (of the text) corresponds to b (of the pattern). 2. c to x. 3. d to d. 4. Nothing to y. 5. e to e. 6. f to nothing. 7. g to g. 8. h to h.


Author(s):  
A. Apostolico ◽  
M.J. Atallah

This chapter discusses parallel solutions for the string editing problem introduced in Chapter 5. The model of computation used is the synchronous, shared - memory machine referred to as PRAM and discussed also earlier in this book. The algorithms of this chapter are based on the CREW and CRCW variants of the PRAM. In the CREW - PRAM model of parallel computation concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing). The CRCW - PRAM differs from the CREW - PRAM in that it allows many processors to attempt simultaneous writes in the same memory location: in any such common-write contest, only one processor succeeds, but it is not known in advance which one. The primary objective of PRAM algorithmic design is to devise algorithms that are both fast and efficient for problems in a particular class called NC. Problems in NC are solvable in O(logo(1) n) parallel time by a PRAM using a polynomial number of processors. In order for an algorithm to be both fast and efficient, the product of its time and processor complexities must fall within a polylog factor of the time complexity of the best sequential algorithm for the problem it solves. This goal has been elusive for many simple problems, such as topological sorting of a directed acyclic graph and finding a breadth-first search tree of a graph, which are trivially in NC. For some other problems in NC, it seems counter-intuitive at first that any fast and efficient algorithm may exist, due to the overwhelming number of simultaneous subproblems that arise at some point of the computation. Such is the case of the string-editing problem. This chapter will show that string editing can be solved in O((log n)2) time and O(n2/ log n) processors on the CREW-PRAM, and in O(log n loglog n) time and O(n2/ loglogn) processors on the CRCW-PRAM. Throughout, it will be convenient to analyze our algorithms using the time and work (i.e., number of operations) complexities.


Author(s):  
A. Apostolico

In the previous two chapters, we have examined various serial and parallel methods to perform exact string searching in a number of operations proportional to the total length of the input. Even though such a performance is optimal, our treatment of exact searches cannot be considered exhausted yet: in many applications, searches for different, a-priorily unknown patterns are performed on a same text or group of texts. It seems natural to ask whether these cases can be handled better than by plain reiteration of the procedures studied so far. As an analogy, consider the classical problem of searching for a given item in a table with n entries. In general, n comparisons are both necessary and sufficient for this task. If we wanted to perform k such searches, however, it is no longer clear that we need kn comparisons. Our table can be sorted once and for all at a cost of O(n log n) comparisons, after which binary search can be used. For sufficiently large k, this approach outperforms that of the k independent searches. In this chapter, we shall see that the philosophy subtending binary search can be fruitfully applied to string searching. Specifically, the text can be pre-processed once and for all in such a way that any query concerning whether or not a pattern occurs in the text can be answered in time proportional to the length of the pattern. It will also be possible to locate all the occurrences of the pattern in the text at an additional cost proportional to the total number of such occurrences. We call this type of search on-line, to refer to the fact that as soon as we finish reading the pattern we can decide whether or not it occurs in our text. As it turns out, the auxiliary structures used to achieve this goal are well suited to a host of other applications. There are several, essentially equivalent digital structures supporting efficient on-line string searching. Here, we base our discussion on a variant known as suffix tree. It is instructive to discuss first a simplified version of suffix trees, which we call expanded suffix tree.


Author(s):  
Z. Galil ◽  
I. Yudkiewicz

The string matching problem is defined as follows: given a string P0 ... Pm-1 called the pattern and a string T0 .. .Tn-1 called the text find all occurrences of the pattern in the text. The output of a string matching algorithm is a boolean array MATCH[0..n — 1] which contains a true value at each position where an occurrence of the pattern starts. Many sequential algorithms are known that solve this problem optimally, i.e., in a linear O(n) number of operations, most notable of which are the algorithms by Knuth, Morris and Pratt and by Boyer and Moore. In this chapter we limit ourselves to parallel algorithms. All algorithms considered in this chapter are for the parallel random access machine (PRAM) computation model. In the design of parallel algorithms for the various PRAM models, one tries to optimize two factors simultaneously: the number of processors used and the time required by the algorithm. The total number of operations performed, which is the time-processors product, is the measure of optimality. A parallel algorithm is called optimal if it needs the same number of operations as the fastest sequential algorithm. Hence, in the string matching problem, an algorithm is optimal if its time-processor product is linear in the length of the input strings. Apart from having an optimal algorithm the designer wishes the algorithm to be the fastest possible, where the only limit on the number of processors is the one caused by the time-processor product. The following fundamental lemma given by Brent is essential for understanding the tradeoff between time and processors : Any PRAM algoriihm of time t that consists of x elementary operations can be implemented on p processors in O(x/p + t) time. Using Brent’s lemma, any algorithm that uses a large number x of processors to run very fast can be implemented on p < x processors, with the same total work, however with an increase in time as described. A basic problem in the study of parallel algorithms for strings and arrays is finding the maximal/minimal position in an array that holds a certain value.


Sign in / Sign up

Export Citation Format

Share Document