Guided Sequence Alignment

Author(s):  
Abdullah N. Arslan

Sequence alignment is one of the most fundamental problems in computational biology. Ordinarily, the problem aims to align symbols of given sequences in a way to optimize similarity score. This score is computed using a given scoring matrix that assigns a score to every pair of symbols in an alignment. The expectation is that scoring matrices perform well for alignments of all sequences. However, it has been shown that this is not always true although scoring matrices are derived from known similarities. Biological sequences share common sequence structures that are signatures of common functions, or evolutionary relatedness. The alignment process should be guided by constraining the desired alignments to contain these structures even though this does not always yield optimal scores. Changes in biological sequences occur over the course of millions of years, and in ways, and orders we do not completely know. Sequence alignment has become a dynamic area where new knowledge is acquired, new common structures are extracted from sequences, and these yield more sophisticated alignment methods, which in turn yield more knowledge. This feedback loop is essential for this inherently difficult task. The ordinary definition of sequence alignment does not always reveal biologically accurate similarities. To overcome this, there have been attempts that redefined sequence similarity. Huang (1994) proposed an optimization problem in which close matches are rewarded more favorably than the same number of isolated matches. Zhang, Berman & Miller (1998) proposed an algorithm that finds alignments free of low scoring regions. Arslan, Egecioglu, & Pevzner (2001) proposed length-normalized local sequence alignment for which the objective is to find subsequences that yield maximum length-normalized score where the length-normalized score of a given alignment is its score divided by sum of subsequence-lengths involved in the alignment. This can be considered as a contextdependent sequence alignment where a high degree of local similarity defines a context. Arslan, Egecioglu, & Pevzner (2001) presented a fractional programming algorithm for the resulting problem. Although these attempts are important, some biologically meaningful alignments can contain motifs whose inclusions are not guaranteed in the alignments returned by these methods. Our emphasis in this chapter is on methods that guide sequence alignment by requiring desired alignments to contain given common structures identified in sequences (motifs).

2006 ◽  
Vol 17 (06) ◽  
pp. 1325-1344 ◽  
Author(s):  
HEIKKI HYYRÖ ◽  
GONZALO NAVARRO

Local similarity computation between two sequences permits detecting all the relevant alignments present between subsequences thereof. A well-known dynamic programming algorithm works in time O(mn), m and n being the lengths of the subsequences. The algorithm is rather slow when applied over many sequence pairs. In this paper we present the first bit-parallel computation of the score matrix, for a simplified choice of scores. If the computer word has w bits, then the resulting algorithm works in O(mn log min (m, n, w)/w) time, achieving up to 8-fold speedups in practice. Some DNA comparison applications use precisely the simplified scores we handle, and thus our algorithm is directly applicable. In others, our method could be used as a raw filter to discard most of the strings, so the classical algorithm can be focused only on the substring pairs that can yield relevant results.


2022 ◽  
Vol 2161 (1) ◽  
pp. 012028
Author(s):  
Karamjeet Kaur ◽  
Sudeshna Chakraborty ◽  
Manoj Kumar Gupta

Abstract In bioinformatics, sequence alignment is very important task to compare and find similarity between biological sequences. Smith Waterman algorithm is most widely used for alignment process but it has quadratic time complexity. This algorithm is using sequential approach so if the no. of biological sequences is increasing then it takes too much time to align sequences. In this paper, parallel approach of Smith Waterman algorithm is proposed and implemented according to the architecture of graphic processing unit using CUDA in which features of GPU is combined with CPU in such a way that alignment process is three times faster than sequential implementation of Smith Waterman algorithm and helps in accelerating the performance of sequence alignment using GPU. This paper describes the parallel implementation of sequence alignment using GPU and this intra-task parallelization strategy reduces the execution time. The results show significant runtime savings on GPU.


Author(s):  
Souvik Das

Abstract: The word ‘life’ is a mysterious word with a chart of attributes that have neither been completed nor has been agreed upon by the race of humans. Probably the proper definition of life is impossible to identify for humans (the proof for this claim is given later) but the handbook to the secret shall be updated till the end, thanks to the inquisitive attitude of humans. For this piece, we shall adopt the description from the professional medical community of today. Though this topic falls midway between science and philosophy, this project is strictly technical. To quote dictionary.com, Life is the condition that distinguishes organisms from inorganic objects and dead organisms, being manifested by growth through metabolism, reproduction and the power of adaptation to environment- through changes originating internally; cambridge.com teaches Life is the period between birth and death, or the experience or state of being alive; medicaldictionary.thefreedictionary.com states Life is the property or quality that distinguishes living organisms from dead organisms and inanimate matter, manifested in functions such as metabolism, growth, reproduction and response to stimuli or adaptation to the environment originating from within the organisms. There are several other definitions but to summarize, we can safely state that though the concept is somewhat vague, we could indeed point out some common principles. We shall, in this project, try to replicate the characteristics so as to attain life in medical terms. (The order does not base upon importance of the listed character since the characters, all of them are absolute essentials and cannot possibly be categorized as more or less important). 1) Metabolism 2) Growth 3) Adaptability 4) Birth 5) Death 6) Self-stimulated response to environment 7) Reproduction 8) Can sustain self without foreign intervention Keywords: artificial, life, intelligence, computer, programming, algorithm This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.


1994 ◽  
Vol 304 (1) ◽  
pp. 95-99 ◽  
Author(s):  
G Labesse ◽  
A Vidal-Cros ◽  
J Chomilier ◽  
M Gaudry ◽  
J P Mornon

Using both primary- and tertiary-structure comparisons, we have established new structural similarities shared by reductases, epimerases and dehydrogenases not previously known to be related. Despite the low sequence identity (down to 10%), short consensus segments are identified. We show that the sequence, the active site and the supersecondary structure are well conserved in these proteins. New homologues (the protochlorophyllide reductases) are detected, and we define a new superfamily composed of single-domain dinucleotide-binding enzymes. Rules for the cofactor-binding specificity are deduced from our sequence alignment. The involvement of some amino acids in catalysis is discussed. Comparison with two-domain dehydrogenases allows us to distinguish two general mechanisms of divergent evolution.


2014 ◽  
Author(s):  
Richard Wilton ◽  
Tamas Budavari ◽  
Ben Langmead ◽  
Sarah J Wheelan ◽  
Steven Salzberg ◽  
...  

Motivation: In computing pairwise alignments of biological sequences, software implementations employ a variety of heuristics that decrease the computational effort involved in computing potential alignments. A key element in achieving high processing throughput is to identify and prioritize potential alignments where high-scoring mappings can be expected. These tasks involve list-processing operations that can be efficiently performed on GPU hardware. Results: We implemented a read aligner called A21 that exploits GPU-based parallel sort and reduction techniques to restrict the number of locations where potential alignments may be found. When compared with other high-throughput aligners, this approach finds more high-scoring mappings without sacrificing speed or accuracy. A21 running on a single GPU is about 10 times faster than comparable CPU-based tools; it is also faster and more sensitive in comparison with other recent GPU-based aligners.


Author(s):  
David J. States ◽  
Mark S. Boguski

Properly approached, molecular sequence data is a rich source of knowledge capable of teaching us much about the structure, function, and evolution of biological macromolecules. To effectively realize this potential, however, some understanding of the process of and theoretical basis for sequence comparison is needed as well as a variety of practical tools to access and manipulate the data. The volume of molecular sequence data has long since surpassed human information processing capacity for even simple tasks such as searching for related sequences, and with the ever increasing rate at which new sequences are being produced, the need for computer-assisted analysis becomes more and more acute. Automated tools can extend human capabilities by orders of magnitude in both speed and accuracy. The educated application of these automated tools is an essential part of modern molecular biology research. This chapter considers the theory and practice of analyzing sequence similarity as it applies to database searching and sequence alignment. Five major areas will be examined. First, we describe the use of dot matrix plots to elucidate the structures and features relating a sequence pair. Secondly, we discuss optimal pairwise alignment of sequences using dynamic programming algorithms. Thirdly, we examine fast, approximate techniques for detecting local similarities. Fourthly, the uses of and techniques for multiple sequence alignment are described. Finally, the statistical significance of sequence similarity is considered. In the analysis of molecular sequences, the terms similarity andhomology are often used without a clear understanding of their distinct implications. Similarity is a descriptive term which only implies that two sequences, by some criterion, resemble each other and carries no suggestion as to their origins or ancestry. Homology refers specifically to similarity due to descent from a common ancestor (Patterson, 1988;Reeck etal., 1987). On the basis of similarity relationships among a group of sequences, it may be possible to infer homology, but outside of an explicit laboratory model system, descent from a common ancestor remains hypothetical. There are philosophical issues in the inference of homology as well as practical ones. In classical morphology, conjunction (the occurrence of two traits in a single individual) is considered evidence that they are not homologous (Patterson, 1982).


2018 ◽  
Vol 10 (12) ◽  
pp. 124
Author(s):  
Ziyun Deng ◽  
Tingqin He

To obtain the target webpages from many webpages, we proposed a Method for Filtering Pages by Similarity Degree based on Dynamic Programming (MFPSDDP). The method needs to use one of three same relationships proposed between two nodes, so we give the definition of the three same relationships. The biggest innovation of MFPSDDP is that it does not need to know the structures of webpages in advance. First, we address the design ideas with queue and double threads. Then, a dynamic programming algorithm for calculating the length of the longest common subsequence and a formula for calculating similarity are proposed. Further, for obtaining detailed information webpages from 200,000 webpages downloaded from the famous website “www.jd.com”, we choose the same relationship Completely Same Relationship (CSR) and set the similarity threshold to 0.2. The Recall Ratio (RR) of MFPSDDP is in the middle in the four filtering methods compared. When the number of webpages filtered is nearly 200,000, the PR of MFPSDDP is highest in the four filtering methods compared, which can reach 85.1%. The PR of MFPSDDP is 13.3 percentage points higher than the PR of a Method for Filtering Pages by Containing Strings (MFPCS).


2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Sriram P. Chockalingam ◽  
Jodh Pannu ◽  
Sahar Hooshmand ◽  
Sharma V. Thankachan ◽  
Srinivas Aluru

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.


2011 ◽  
Vol 09 (06) ◽  
pp. 681-695 ◽  
Author(s):  
MARCO A. ALVAREZ ◽  
CHANGHUI YAN

Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.


Sign in / Sign up

Export Citation Format

Share Document