Hardware Designs for Local Alignment of Protein Sequences

Author(s):  
Mustafa Gök ◽  
Çağlar Yılmaz
2014 ◽  
Vol 2 (3) ◽  
pp. 224-228
Author(s):  
Jennifer Tram

Every year the FDA issues a recommendation for the composition of the year’s common influenza vaccine for influenzas A and B.  The FDA can consistently predict the dominance of a particular strand of influenza virus by taking into account previous years’ antigenic characterization percentages. However, the sudden disappearance of dominant antigens and the sudden emergence of drift variants can disrupt this pattern, which questions the effectiveness of that year’s vaccine. Basic Local Alignment Search Tool was used to compare the protein sequences for hemagglutinin and neuraminidase between the strands in the vaccine and the dominant viral strands. This study examined the effectiveness of vaccines from 2000 to 2012, focusing on the transitions between the B/Yamagata and B/Victoria lineages and A/New Caledonia and A/California lineages (H1N1). Between the years 2005 and 2006, dominance of the B/Yamagata lineage, represented by B/Shanghai/361/2002, disappeared almost entirely. For the 2005-2006 flu season, the CDC recommended a B/Shanghai/361/2002 vaccine which expressed a 98% identity to the dominant influenza B hemagglutinin sequence and a 97% identity to the dominant neuraminidase sequence. From 2007 to 2008, the A/New Caledonia virus declined to 34% of cases while the A/Solomon Islands/3/2006 virus increased to 66%. The A/New Caledonia/20/99 vaccine effectively expressed a 97% identity to the hemagglutinin sequence of A/Solomon Islands/3/2006 strand and a 98% identity to the neuraminidase sequence. This study demonstrates that from 2000 to 2012, despite drift variants in influenza viruses, the CDC-recommended vaccine effectively matches the hemagglutinin and neuraminidase protein sequences of the dominant viruses.DOI: http://dx.doi.org/10.3126/ijasbt.v2i3.10952 Int J Appl Sci Biotechnol, Vol. 2(3): 224-228  


Author(s):  
Timothy L. Bailey

We are in the midst of an explosive increase in the number of DNA and protein sequences available for study, as various genome projects come on line. This wealth of information offers important opportunities for understanding many biological processes and developing new plant and animal models, and ultimately drugs, for human diseases, in addition to other applications of modern biotechnology. Unfortunately, sequences are accumulating at a pace that strains present methods for extracting significant biological information from them. A consequence of this explosion in the sequence databases is that there is much interest and effort in developing tools that can efficiently and automatically extract the relevant biological information in sequence data and make it available for use in biology and medicine. In this chapter, we describe one such method that we have developed based on algorithms from artificial intelligence research. We call this software tool MEME (Multiple Expectation-maximization for Motif Elicitation). It has the attractive property that it is an “unsupervised” discovery tool: it can identify motifs, such as regulatory sites in DNA and functional domains in proteins, from large or small groups of unaligned sequences. As we show below, motifs are a rich source of information about a dataset; they can be used to discover other homologs in a database, to identify protein subsets that contain one or more motifs, and to provide information for mutagenesis studies to elucidate structure and function in the protein family as well as its evolution. Learning tools are used to extract higher level biological patterns from lower level DNA and protein sequence data. In contrast, search tools such as BLAST (Basic Local Alignment Search Tool) take a given higher level pattern and find all items in a database that possess the pattern. Searching for items that have a certain pattern is a problem intrinsically easier than discovering what the pattern is from items that possess it. The patterns considered here are motifs, which for DNA data can be subsequences that interact with transcription factors, polymerases, and other proteins.


1998 ◽  
Vol 54 (6) ◽  
pp. 1139-1146 ◽  
Author(s):  
Geoffrey J. Barton

The basic algorithms for alignment of two or more protein sequences are explained. Alternative methods for scoring substitutions and gaps (insertions and deletions) are described, as are global and local alignment methods. Multiple alignment techniques are explained, including methods for profile comparison. A summary is given of programs for the alignment and analysis of protein sequences, either from sequence alone, or from three-dimensional structure.


2007 ◽  
Vol 05 (03) ◽  
pp. 717-738 ◽  
Author(s):  
ELEAZAR ESKIN ◽  
SAGI SNIR

Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.


2017 ◽  
Author(s):  
Christophe Menichelli ◽  
Olivier Gascuel ◽  
Laurent Bréhélin

AbstractMotivationComparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure.ResultsHere, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 16% of the number of significant BLAST hits and an increase of 28% of the proteome area that can be covered with a domain. Our method identified 2473 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains.AvailabilitySoftware implementing the proposed approach and the Supplementary Data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence


2014 ◽  
Vol 16 (3) ◽  
pp. 56-61
Author(s):  
Manhal Elfadil Eltayeeb Elnour ◽  
◽  
Muhammad Shafie Abd Latif ◽  
Ismail Fauzi Isnin

2021 ◽  
Author(s):  
Sajithra Nakshathram ◽  
Ramyachitra Duraisamy ◽  
Manikandan Pandurangan

Abstract Background: Protein Remote Homology Detection (PRHD) is used to find the homologous proteins which are similar in function and structure but sharing low sequence identity. In general, the Sequence-Order Frequency Matrix (SOFM) was used for protein remote homology detection. In the SOFM Top-n-gram (SOFM-Top) algorithm, the probability of substrings was calculated based on the highest probability value of substrings. Moreover, SOFM-Smith Waterman (SOFM-SW) algorithm combines the SOFM with local alignment for protein remote homology detection. However, the computation complexity of SOFM based PRHD is high since it processes all protein sequences in SOFM.Objective: Sequence-Order Frequency Matrix - Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) algorithm is proposed for predicting the protein remote homology. The SOFM-SMSW algorithm used the PVS method to select the optimum target sequences based on the uniform distribution measure.Method: This research work considers the most important sequences for PRHD by introducing Proportional Volume Sampling (PVS). After sampling the protein sequences, a feature vector is constructed and labeling is performed based on the concatenation between two protein sequences. Then, a substitution score which represents the structural alignment is learned using k-Nearest Neighbor (k-NN). Based on the learned substitution score and alignment score, the protein homology is detected using Smith-Waterman algorithm and Support Vector Machine (SVM). By selecting the most important sequences, the accuracy of PRHD is improved and the computational complexity for PRHD is reduced by using structural alignment along with the local alignment.Results: The performance of the proposed SOFM-SMSW algorithm is tested with SCOP database and it has been compared with various existing algorithms such as SVM Top-N-gram, SVM pairwise, GPkernal, Long Short-Term Memory (LSTM), SOFM Top-N-gram and SOFM-SW. Conclusion: The experimental results illustrate that the proposed SOFM-SMSW algorithm has better accuracy, precision, recall, ROC and ROC 50 for PRHD than the other existing algorithms.


Author(s):  
M. Vidyasagar

This book explores important aspects of Markov and hidden Markov processes and the applications of these ideas to various problems in computational biology. It starts from first principles, so that no previous knowledge of probability is necessary. However, the work is rigorous and mathematical, making it useful to engineers and mathematicians, even those not interested in biological applications. A range of exercises is provided, including drills to familiarize the reader with concepts and more advanced problems that require deep thinking about the theory. Biological applications are taken from post-genomic biology, especially genomics and proteomics. The topics examined include standard material such as the Perron–Frobenius theorem, transient and recurrent states, hitting probabilities and hitting times, maximum likelihood estimation, the Viterbi algorithm, and the Baum–Welch algorithm. The book contains discussions of extremely useful topics not usually seen at the basic level, such as ergodicity of Markov processes, Markov Chain Monte Carlo (MCMC), information theory, and large deviation theory for both i.i.d and Markov processes. It also presents state-of-the-art realization theory for hidden Markov models. Among biological applications, it offers an in-depth look at the BLAST (Basic Local Alignment Search Technique) algorithm, including a comprehensive explanation of the underlying theory. Other applications such as profile hidden Markov models are also explored.


Sign in / Sign up

Export Citation Format

Share Document