Fast Algorithms for Local Similarity Queries in Two Sequences

In sequence comparison, finding local similarities in given strings is a very important well-known problem. In this work we introduce two local sequence similarity query problems, and present algorithms for them. Our algorithms use a data structure that supports constant time longest common extension queries. This data structure is created only once, and in time linear in the size of the input strings. After this step all subsequent local similarity queries can be answered very fast. Existing algorithms take significantly more time in answering these queries.

Download Full-text

DNA sequence comparison based on Tabular Representation

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v4i1c.3121 ◽

2013 ◽

Vol 4 (1) ◽

pp. 172-175

Author(s):

Archana Verma ◽

Mr. R.K.Bharti ◽

Prof. R.K. Singh

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Sequence Comparison ◽

Sequence Similarity ◽

Information Matrix ◽

Similarity Score ◽

Sequence Composition ◽

Local Sequence ◽

Dna Sequence Comparison ◽

Similarity Scores

DNA sequence comparison remains as one of the critical steps in the analysis of phylogenetic relationships between species. In order to get quantitative comparison, we want to devise an algorithm that would use the tabular representation of DNA sequences. The tabular approach of representation captures the essence of the base composition and distribution of the sequence. In this contribution, we take the tabular notation for DNA sequences and then these tables are compared to find the similarity/dissimilarity measure of the sequences. We have developed algorithms for comparing DNA sequences. These programs help us to search similar segments of sequences, calculate similarity scores and identify repetitions based on local sequence similarity. There are two approaches: one is to find the exact similarity and another is to find the measurement for similarity. The first approach is more sensitive, which can be used to search DNA sequence similarities only if complete matches occurred and can compare exactly similar sequences only. This approach violates if a single mismatch for any base character appears so it is not a general solution. To find the miss matches along with the matches we have suggested another approach which compiles the information matrix based on matches and miss matches. This approach is quiet general in terms of sequences which have a large fragment common with less no of dissimilar base characters. This alternate approach includes an additional step in the calculation of the similarity score that denotes multiple regions of similarity between sequences. For both these approaches computer programs are prepared and tested on data sets. These programs can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. In addition, these programs have been generalized to allow comparison of DNA sequences based on a variety of alternative scoring matrices. We have been developing tools for the analysis of protein The method is very simple and fast, and it can be used to analyze both short and long DNA sequences. The utility of this method is tested on the several sequences of species and the results are consistent with that reported.

Download Full-text

IGLOSS: iterative gapless local similarity search

Bioinformatics ◽

10.1093/bioinformatics/btz086 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3491-3492

Author(s):

Braslav Rabar ◽

Maja Zagorščak ◽

Strahil Ristov ◽

Martin Rosenzweig ◽

Pavle Goldstein

Keyword(s):

Parameter Estimation ◽

Similarity Search ◽

Sequence Similarity ◽

Web Server ◽

Supplementary Information ◽

Local Similarity ◽

Supplementary Data ◽

Matching Algorithm ◽

Local Sequence ◽

Sequence Patterns

Abstract Summary Searching for local sequence patterns is one of the basic tasks in bioinformatics. Sequence patterns might have structural, functional or some other relevance, and numerous methods have been developed to detect and analyze them. These methods often depend on the wealth of information already collected. The explosion in the number of newly available sequences calls for novel methods to explore local sequence similarity. We have developed a new method for iterative motif scanning that will look for ungapped sequence patterns similar to a submitted query. Using careful parameter estimation and an adaptation of a fast string-matching algorithm, the method performs significantly better in this context than the existing software. Availability and implementation The IGLOSS web server is available at http://compbioserv.math.hr/igloss/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Suffix Tree Constructing Algorithm for Datasets with Discrete Contents

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.733.867 ◽

2015 ◽

Vol 733 ◽

pp. 867-870

Author(s):

Zhen Zhong Jin ◽

Zheng Huang ◽

Hua Zhang

Keyword(s):

Data Structure ◽

Sensor Network ◽

Association Analysis ◽

Data Structures ◽

Suffix Tree ◽

Analysis Data ◽

Large Datasets ◽

Intermediate Data ◽

Input Strings

The suffix tree is a useful data structure constructed for indexing strings. However, when it comes to large datasets of discrete contents, most existing algorithms become very inefficient. Discrete datasets are need to be indexed in many fields like record analysis, data analyze in sensor network, association analysis etc. This paper presents an algorithm, STD, which stands for Suffix Tree for Discrete contents, that performs very efficiently with discrete input datasets. It imports several wonderful intermediate data structures for discrete strings; we also take care of the situation that the discrete input strings have similar characteristics. Moreover, STD keeps the advantages of existing implementations which are for successive input strings. Experiments were taken to evaluate the performance and shown that the method works well.

Download Full-text

Prediction of Protein–Ligand Interaction Based on the Positional Similarity Scores Derived from Amino Acid Sequences

International Journal of Molecular Sciences ◽

10.3390/ijms21010024 ◽

2019 ◽

Vol 21 (1) ◽

pp. 24 ◽

Cited By ~ 3

Author(s):

Dmitry Karasev ◽

Boris Sobolev ◽

Alexey Lagunin ◽

Dmitry Filimonov ◽

Vladimir Poroikov

Keyword(s):

Prediction Accuracy ◽

Sequence Similarity ◽

Amino Acid Sequences ◽

Protein Families ◽

Pairwise Sequence Alignment ◽

Ligand Interaction ◽

Protein Ligand Interactions ◽

General Chemical ◽

Local Sequence ◽

Ligand Interactions

The affinity of different drug-like ligands to multiple protein targets reflects general chemical–biological interactions. Computational methods estimating such interactions analyze the available information about the structure of the targets, ligands, or both. Prediction of protein–ligand interactions based on pairwise sequence alignment provides reasonable accuracy if the ligands’ specificity well coincides with the phylogenic taxonomy of the proteins. Methods using multiple alignment require an accurate match of functionally significant residues. Such conditions may not be met in the case of diverged protein families. To overcome these limitations, we propose an approach based on the analysis of local sequence similarity within the set of analyzed proteins. The positional scores, calculated by sequence fragment comparisons, are used as input data for the Bayesian classifier. Our approach provides a prediction accuracy comparable or exceeding those of other methods. It was demonstrated on the popular Gold Standard test sets, presenting different sequence heterogeneity and varying from the group, including different protein families to the more specific groups. A reasonable prediction accuracy was also found for protein kinases, displaying weak relationships between sequence phylogeny and inhibitor specificity. Thus, our method can be applied to the broad area of protein–ligand interactions.

Download Full-text

Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores

Bulletin of Mathematical Biology ◽

10.1016/s0092-8240(05)80176-4 ◽

1992 ◽

Vol 54 (1) ◽

pp. 59-75 ◽

Cited By ~ 9

Author(s):

R MOTT

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Sequence Similarity ◽

Likelihood Estimation ◽

Statistical Distribution ◽

Local Sequence ◽

Similarity Scores

Download Full-text

Partial recN gene sequencing: a new tool for identification and phylogeny within the genus Streptococcus

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.018176-0 ◽

2010 ◽

Vol 60 (9) ◽

pp. 2140-2148 ◽

Cited By ~ 43

Author(s):

Olga O. Glazunova ◽

Didier Raoult ◽

Véronique Roux

Keyword(s):

Genetic Diversity ◽

Sequence Comparison ◽

Gene Sequence ◽

Sequence Similarity ◽

Rrna Gene ◽

Sequence Comparisons ◽

High Genetic Diversity ◽

Repair Protein ◽

The Mean ◽

The 16S Rrna Gene

Partial sequences of the recN gene (1249 bp), which encodes a recombination and repair protein, were analysed to determine the phylogenetic relationship and identification of streptococci. The partial sequences presented interspecies nucleotide similarity of 56.4–98.2 % and intersubspecies similarity of 89.8–98 %. The mean DNA sequence similarity of recN gene sequences (66.6 %) was found to be lower than those of the 16S rRNA gene (94.1 %), rpoB (84.6 %), sodA (74.8 %), groEL (78.1 %) and gyrB (73.2 %). Phylogenetically derived trees revealed six statistically supported groups: Streptococcus salivarius, S. equinus, S. hyovaginalis/S. pluranimalium/S. thoraltensis, S. pyogenes, S. mutans and S. suis. The ‘mitis’ group was not supported by a significant bootstrap value, but three statistically supported subgroups were noted: Streptococcus sanguinis/S. cristatus/S. sinensis, S. anginosus/S. intermedius/S. constellatus (the ‘anginosus’ subgroup) and S. mitis/S. infantis/S. peroris/S. oralis/S. oligofermentans/S. pneumoniae/S. pseudopneumoniae. The partial recN gene sequence comparison highlighted a high percentage of divergence between Streptococcus dysgalactiae subsp. dysgalactiae and S. dysgalactiae subsp. equisimilis. This observation is confirmed by other gene sequence comparisons (groEL, gyrB, rpoB and sodA). A high percentage of similarity was found between S. intermedius and S. constellatus after sequence comparison of the recN gene. To study the genetic diversity among the ‘anginosus’ subgroup, recN, groEL, sodA, gyrB and rpoB sequences were determined for 36 clinical isolates. The results that were obtained confirmed the high genetic diversity within this group of streptococci.

Download Full-text

Molecular identification ofAustrobilharziaspecies parasitizingCerithidea cingulata(Gastropoda: Potamididae) from Kuwait Bay

Journal of Helminthology ◽

10.1017/s0022149x11000733 ◽

2011 ◽

Vol 86 (4) ◽

pp. 470-478 ◽

Cited By ~ 5

Author(s):

W.Y. Al-Kandari ◽

S.A. Al-Bustan ◽

A.M. Isaac ◽

B.A. George ◽

B.S. Chandy

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Sequence Similarity ◽

Morphological Identification ◽

High Sequence Similarity ◽

Data Set ◽

Kuwait Bay ◽

Causative Agents ◽

Avian Schistosomes ◽

Combined Data

AbstractAvian schistosomes belonging to the genusAustrobilharzia(Digenea: Schistosomatidae) are among the causative agents of cercarial dermatitis in humans. In this paper, ribosomal and mitochondrial DNA sequences were used to study schistosome cercariae from Kuwait Bay that have been identified morphologically asAustrobilharziasp. Sequence comparison of the ribosomal DNA (rDNA) 28S and 18S regions of the collected schistosome cercariae with corresponding sequences of other schistosomes in GenBank revealed high sequence similarity. This confirmed the morphological identification of schistosome cercariae from Kuwait Bay as belonging to the genusAustrobilharzia. The finding was further supported by the phylogenetic tree that was constructed based on the combined data set 18S-28S-mitochondrial cytochrome oxidase I (mtCO1) sequences in whichAustrobilharziasp. clustered withA. terrigalensisandA. variglandis. Sequence comparison of theAustrobilharziasp. from Kuwait Bay withA. variglandisandA. terrigalensisbased on mtCO1 showed a variation of 10% and 11%, respectively. Since the sequence variation in the mtCO1 was within the interspecific range among trematodes, it seems that theAustrobilharziaspecies from Kuwait Bay is different from the two species reported in GenBank,A.terrigalensisandA. variglandis.

Download Full-text