scholarly journals An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Sriram P. Chockalingam ◽  
Jodh Pannu ◽  
Sahar Hooshmand ◽  
Sharma V. Thankachan ◽  
Srinivas Aluru

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

2018 ◽  
Author(s):  
Chris-Andre Leimeister ◽  
Jendrik Schellhorn ◽  
Svenja Schöbel ◽  
Michael Gerth ◽  
Christoph Bleidorn ◽  
...  

AbstractWord-based or ‘alignment-free’ sequence comparison has become an active area of research in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Herein, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees from whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github:https://github.com/jschellh/ProtSpaM


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Camilla Reginatto De Pierri ◽  
Ricardo Voyceik ◽  
Letícia Graziela Costa Santos de Mattos ◽  
Mariane Gonçalves Kulik ◽  
Josué Oliveira Camargo ◽  
...  

AbstractVectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.


2021 ◽  
Author(s):  
Michael Heinzinger ◽  
Maria Littmann ◽  
Ian Sillitoe ◽  
Nicola Bordin ◽  
Christine Orengo ◽  
...  

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .


2019 ◽  
Vol 5 ◽  
Author(s):  
Alexis Criscuolo

This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.


Author(s):  
Giacomo Janson ◽  
Alessandro Paiardini

Abstract Summary The PyMod project is designed to act as a fully integrated interface between the popular molecular graphics viewer PyMOL, and some of the most frequently used tools for structural bioinformatics, e.g. BLAST, HMMER, Clustal, MUSCLE, PSIPRED, DOPE and MODELLER. Here we report its latest release, PyMod 3, which has been completely renewed with a graphical interface written in PyQt, to make it compatible with the most recent PyMOL versions, and has been extended with a large set of new functionalities compared to its predecessor, i.e. PyMod 2. Starting from the amino acid sequence of a target protein, users can take advantage of PyMod 3 to carry out all the steps of the homology modeling process (i.e. template searching, target–template sequence alignment, model building and quality assessment). Additionally, the integrated tools in PyMod 3 may also be used alone, in order to extend PyMOL with a wide range of capabilities. Sequence similarity searches, multiple sequence/structure alignment building, phylogenetic trees and evolutionary conservation analyses, domain parsing, single/multiple chains and loop modeling can be performed in the PyMod 3/PyMOL environment. Availability and implementation A cross-platform PyMod 3 installer package for Windows, Linux and Mac OS X and a complete user guide with tutorials, are available at https://github.com/pymodproject/pymod Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 73 ◽  
Author(s):  
Yongyong Kang ◽  
Xiaofei Yang ◽  
Jiadong Lin ◽  
Kai Ye

Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.


2018 ◽  
Author(s):  
Edgar Garriga ◽  
Paolo Di Tommaso ◽  
Cedrik Magis ◽  
Ionas Erb ◽  
Hafid Laayouni ◽  
...  

AbstractInferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses.One Sentence SummaryInitiating alignments with the most dissimilar sequences allows slow and accurate methods to be used on large datasets


Sign in / Sign up

Export Citation Format

Share Document