Application of the Cycles Merging Algorithm to the Shortest Common Superstring Problem

Author(s):  
Yuliya F. Leonova ◽  
Anatoly V. Panyukov
Author(s):  
M. Li ◽  
T. Jiang

Given a finite set of strings S = {s1,...,sm}, the shortest common superstring of S, is the shortest string s such that each si appears as a substring (a consecutive block) of s. . . . Example. . . . . . . Assume we want to find the shortest common superstring of all words in the sentence “alf ate half lethal alpha alfalfa.” Our set of strings is S = { alf, ate, half, lethal, alpha, alfalfa }. A trivial superstring of S is “alfatehalflethalalphaalfalfa”, of length 28. A shortest common superstring is “lethalphalfalfate”, of length 17, saving 11 characters. The above example shows an application of the shortest common superstring problem in data compression. In many programming languages, a character string may be represented by a pointer to that string. The problem for the compiler is to arrange strings so that they may be “overlapped” as much as possible in order to save space. For more data compression related issues, see next chapter. Other than compressing a sentence about Alf, the shortest common superstring problem has more important applications in DNA sequencing. A DNA sequence may be considered as a long character string over the alphabet of nucleotides {A, C, G, T}. Such a character string ranges from a few thousand symbols long for a simple virus, to 2 x 108 symbols for a fly and 3 x 109 symbols for a human being. Determining this string for different molecules, or sequencing the molecules, is a crucial step towards understanding the biological functions of the molecules. In fact, today, no problem in biochemistry can be studied in isolation from its genetic background. However, with current laboratory methods, such as Sanger’s procedure, it is quite impossible to sequence a long molecule directly as a whole. Each time, a randomly chosen fragment of less than 500 base pairs can be sequenced. In general, biochemists “cut”, using different restriction enzymes, millions of such (identical) molecules into pieces each typically containing about 200-500 nucleotides (characters). A biochemist “samples” the fragments and Sanger’s procedure is applied to sequence the sampled fragment. . . .


Sign in / Sign up

Export Citation Format

Share Document