Evolving models of biological sequence similarity

Similarity analysis of DNA sequences is a fundamental research area in Bioinformatics. The characteristic distribution of L-tuple, which is the tuple of length L, reflects the valuable information contained in a biological sequence and thus may be used in DNA sequence similarity analysis. However, similarity analysis based on characteristic distribution of L-tuple is not effective for the comparison of highly conservative sequences. In this paper, a new similarity measurement approach based on Triplets of Nucleic Acid Bases (TNAB) is introduced for DNA sequence similarity analysis. The new approach characterizes both the content feature and position feature of a DNA sequence using the frequency and position of occurrence of TNAB in the sequence. The experimental results show that the approach based on TNAB is effective for analysing DNA sequence similarity.

Download Full-text

EnTrance: Exploration of Entropy Scaling Ball Cover Search in Protein Sequences

10.1101/2021.05.31.446458 ◽

2021 ◽

Author(s):

Yoonjin Kim ◽

Zhen Guo ◽

Jeffrey A. Robertson ◽

Benjamin Reidys ◽

Ziyan Zhang ◽

...

Keyword(s):

Search Algorithm ◽

Sequence Similarity ◽

Protein Sequences ◽

Biological Data ◽

Biological Databases ◽

Biological Sequence ◽

Absolute Size ◽

Sequence Identification ◽

Build Time ◽

Search Approach

Biological sequence alignment using computational power has received increasing attention as technology develops. It is important to predict if a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics through sequence identification. This task can be facilitated by the rapidly increasing amounts of biological data in DNA and protein databases thanks to the corresponding increase in computational and storage costs. Unfortunately, the growth in biological databases has caused difficulty in exploiting this information. EnTrance presents an approach that can expedite the analysis of this large database by employing entropy scaling. This allows scaling with the amount of entropy in the database instead of scaling with the absolute size of the database. Since DNA and protein sequences are biologically meaningful, the space of biological sequences demonstrates the structure exploited by entropy scaling. As biological sequence databases grow, taking advantage of this structure can be extremely beneficial for reducing query times. EnTrance, the entropy scaling search algorithm introduced here, accelerates the biological sequence search exemplified by tools such as BLAST. EnTrance does this by utilizing a two step search approach. In this fashion, EnTrance quickly reduces the number of potential matches before more exhaustively searching the remaining sequences. Tests of EnTrance show that this approach can lead to improved query times. However, constructing the required entropy scaling indices beforehand can be challenging. To improve performance, EnTrance investigates several ideas for accelerating index build time that supports entropy scaling searches. In particular, EnTrance makes full use of the concurrency features of Go language greatly reducing the index build time. Our results identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches. Finally, EnTrance returns more matches and higher percentage identity matches when compared with existing tools.

Download Full-text

Evolving Fisher Kernels for Biological Sequence Classification

Evolutionary Computation ◽

10.1162/evco_a_00065 ◽

2013 ◽

Vol 21 (1) ◽

pp. 83-105 ◽

Cited By ~ 2

Author(s):

K.-J. Won ◽

C. Saunders ◽

A. Prügel-Bennett

Keyword(s):

Sequence Similarity ◽

Generative Models ◽

Complex Model ◽

Generative Model ◽

Support Vector ◽

Homologous Sequence ◽

Sequence Information ◽

Biological Sequence ◽

Domain Specific ◽

Fisher Kernel

Fisher kernels have been successfully applied to many problems in bioinformatics. However, their success depends on the quality of the generative model upon which they are built. For Fisher kernel techniques to be used on novel problems, a mechanism for creating accurate generative models is required. A novel framework is presented for automatically creating domain-specific generative models that can be used to produce Fisher kernels for support vector machines (SVMs) and other kernel methods. The framework enables the capture of prior knowledge and addresses the issue of domain-specific kernels, both of which are current areas that are lacking in many kernel-based methods. To obtain the generative model, genetic algorithms are used to evolve the structure of hidden Markov models (HMMs). A Fisher kernel is subsequently created from the HMM, and used in conjunction with an SVM, to improve the discriminative power. This paper investigates the effectiveness of the proposed method, named GA-SVM. We show that its performance is comparable if not better than other state of the art methods in classifying secretory protein sequences of malaria. More interestingly, it showed better results than the sequence-similarity-based approach, without the need for additional homologous sequence information in protein enzyme family classification. The experiments clearly demonstrate that the GA-SVM is a novel way to find features with good performance from biological sequences, that does not require extensive tuning of a complex model.

Download Full-text

Genetic Similarity Analysis Based on Positive and Negative Sequence Patterns of DNA

Symmetry ◽

10.3390/sym12122090 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2090

Author(s):

Yue Lu ◽

Long Zhao ◽

Zhao Li ◽

Xiangjun Dong

Keyword(s):

Dna Sequences ◽

Sequence Similarity ◽

Sequential Patterns ◽

Similarity Analysis ◽

Frequent Patterns ◽

Biological Sequences ◽

Biological Sequence ◽

Genetic Characteristics ◽

Missing Gene ◽

Sequence Similarity Analysis

Similarity analysis of DNA sequences can clarify the homology between sequences and predict the structure of, and relationship between, them. At the same time, the frequent patterns of biological sequences explain not only the genetic characteristics of the organism, but they also serve as relevant markers for certain events of biological sequences. However, most of the aforementioned biological sequence similarity analysis methods are targeted at the entire sequential pattern, which ignores the missing gene fragment that may induce potential disease. The similarity analysis of such sequences containing a missing gene item is a blank. Consequently, some sequences with missing bases are ignored or not effectively analyzed. Thus, this paper presents a new method for DNA sequence similarity analysis. Using this method, we first mined not only positive sequential patterns, but also sequential patterns that were missing some of the base terms (collectively referred to as negative sequential patterns). Subsequently, we used these frequent patterns for similarity analysis on a two-dimensional plane. Several experiments were conducted in order to verify the effectiveness of this algorithm. The experimental results demonstrated that the algorithm can obtain various results through the selection of frequent sequential patterns and that accuracy and time efficiency was improved.

Download Full-text

Novel visualization method for biological sequence similarity reports

Journal of Electronic Imaging ◽

10.1117/1.1289351 ◽

2000 ◽

Vol 9 (4) ◽

pp. 394 ◽

Cited By ~ 1

Author(s):

Ed H. Chi

Keyword(s):

Sequence Similarity ◽

Visualization Method ◽

Biological Sequence

Download Full-text

Visualization of biological sequence similarity search results

Proceedings Visualization '95 ◽

10.1109/visual.1995.480794 ◽

2002 ◽

Cited By ~ 6

Author(s):

E.H.-H. Chi ◽

P. Barry ◽

E. Shoop ◽

J.V. Carlis ◽

E. Retzel ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Biological Sequence ◽

Search Results

Download Full-text

Flexible information visualization of multivariate data from biological sequence similarity searches

Proceedings of Seventh Annual IEEE Visualization '96 ◽

10.1109/visual.1996.567796 ◽

2002 ◽

Cited By ~ 10

Author(s):

E.H.-H. Chi ◽

J. Riedl ◽

E. Shoop ◽

J.V. Carlis ◽

E. Retzel ◽

...

Keyword(s):

Information Visualization ◽

Sequence Similarity ◽

Multivariate Data ◽

Biological Sequence ◽

Similarity Searches

Download Full-text

EDUCATIONAL PEARL: Biological sequence similarity

Journal of Functional Programming ◽

10.1017/s095679680500571x ◽

2005 ◽

Vol 16 (1) ◽

pp. 1-12

Author(s):

DAVID WAKELING

Keyword(s):

Functional Programming ◽

Sequence Similarity ◽

Functional Languages ◽

Biological Sequence ◽

What Matters

Functional languages provide an excellent framework for formulating biological algorithms in a naive form and then transforming them into an efficient form. This helps biologists understand what matters about programming and brings functional programming into the realm of the practical. In this column, we present an example from our MSc course on bioinformatics and report on our experiences teaching functional programming in this context.

Download Full-text