Serial Computations of Levenshtein Distances
In the previous chapters, we discussed problems involving an exact match of string patterns. We now turn to problems involving similar but not necessarily exact pattern matches. There are a number of similarity or distance measures, and many of them are special cases or generalizations of the Levenshtein metric. The problem of evaluating the measure of string similarity has numerous applications, including one arising in the study of the evolution of long molecules such as proteins. In this chapter, we focus on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance. The Levenshtein distance is a metric that measures the similarity of two strings. In its simple form, the Levenshtein distance, D(x , y), between strings x and y is the minimum number of character insertions and/or deletions (indels) required to transform string x into string y. A commonly used generalization of the Levenshtein distance is the minimum cost of transforming x into y when the allowable operations are character insertion, deletion, and substitution, with costs δ(λ , σ), δ(σ, λ), and δ(σ1, σ2) , that are functions of the involved character(s). There are direct correspondences between the Levenshtein distance of two strings, the length of the shortest edit sequence from one string to the other, and the length of the longest common subsequence (LCS) of those strings. If D is the simple Levenshtein distance between two strings having lengths m and n, SES is the length of the shortest edit sequence between the strings, and L is the length of an LCS of the strings, then SES = D and L = (m + n — D)/2. We will focus on the problem of determining the length of an LCS and also on the related problem of recovering an LCS. Another related problem, which will be discussed in Chapter 6, is that of approximate string matching, in which it is desired to locate all positions within string y which begin an approximation to string x containing at most D errors (insertions or deletions).