Similarity analysis of DNA sequences based on a compact representation

Author(s):  
Zhujin Zhang ◽  
Shuo Wang ◽  
Xingyi Zhang ◽  
Zheng Zhang
2010 ◽  
Vol 32 (4) ◽  
pp. 675-680 ◽  
Author(s):  
Chun Li ◽  
Hong Ma ◽  
Yang Zhou ◽  
Xiaolei Wang ◽  
Xiaoqi Zheng

2014 ◽  
Vol 53 ◽  
Author(s):  
Loek Cleophas ◽  
Derrick G. Kourie ◽  
Bruce W. Watson

In indexing of, and pattern matching on, DNA and text sequences, it is often important to represent all factors of a sequence. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for both short and long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4 − 512, showing gains of up to 10% in total number of transitions, with failure transitions also taking up less space than symbol transitions. The resulting FFOs can be used for indexing, as well as in a variant of the FO-using backward oracle matching algorithm. We discuss and classify this pattern matching algorithm in terms of the keyword pattern matching taxonomies of Watson, Cleophas and Zwaan. We also empirically compared the use of FOs and FFOs in such backward reading pattern matching algorithms, using both DNA and natural language (English) data sets. The results indicate that the decrease in pattern matching performance of an algorithm using an FFO instead of an FO may outweigh the gain in representation space by using an FFO instead of an FO.


Author(s):  
Dan Wei ◽  
Qingshan Jiang ◽  
Sheng Li

Similarity analysis of DNA sequences is a fundamental research area in Bioinformatics. The characteristic distribution of L-tuple, which is the tuple of length L, reflects the valuable information contained in a biological sequence and thus may be used in DNA sequence similarity analysis. However, similarity analysis based on characteristic distribution of L-tuple is not effective for the comparison of highly conservative sequences. In this paper, a new similarity measurement approach based on Triplets of Nucleic Acid Bases (TNAB) is introduced for DNA sequence similarity analysis. The new approach characterizes both the content feature and position feature of a DNA sequence using the frequency and position of occurrence of TNAB in the sequence. The experimental results show that the approach based on TNAB is effective for analysing DNA sequence similarity.


2011 ◽  
Vol 24 (12) ◽  
pp. 2052-2058 ◽  
Author(s):  
Jihong Zhang ◽  
Renhong Wang ◽  
Fenglan Bai ◽  
Junsheng Zheng

Bioinformatics, which is now a well known field of study, originated in the context of biological sequence analysis. Recently graphical representation takes place for the research on DNA sequence. Research in biological sequence is mainly based on the function and its structure. Bioinformatics finds wide range of applications specifically in the domain of molecular biology which focuses on the analysis of molecules viz. DNA, RNA, Protein etc. In this review, we mainly deal with the similarity analysis between sequences and graphical representation of DNA sequence.


2018 ◽  
Vol 66 (2) ◽  
pp. 113-133 ◽  
Author(s):  
Guo-Sen Xie ◽  
Xiao-Bo Jin ◽  
Chunlei Yang ◽  
Jiexin Pu ◽  
Zhongxi Mo

Sign in / Sign up

Export Citation Format

Share Document