Sequoya: multiobjective multiple sequence alignment in Python

Antonio Benítez-Hidalgo; Antonio J Nebro; José F Aldana-Montes

doi:10.1093/bioinformatics/btaa257

Sequoya: multiobjective multiple sequence alignment in Python

Bioinformatics ◽

10.1093/bioinformatics/btaa257 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3892-3893

Author(s):

Antonio Benítez-Hidalgo ◽

Antonio J Nebro ◽

José F Aldana-Montes

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Software Tool ◽

Computing System ◽

Supplementary Information ◽

Optimization Approach ◽

Multiple Sequence ◽

Graphical Tool ◽

Optimal Alignments ◽

Python Programming

Abstract Motivation Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimization problem with NP-hard complexity (non-deterministic polynomial-time hardness), because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multiobjective when more than one score is considered to assess the quality of an alignment, such as maximizing the percentage of totally conserved columns and minimizing the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a nonexact stochastic optimization approach that has proven to be effective to solve multiobjective problems. Results The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualization and parallelism. Thus, Sequoya offers a graphical tool to visualize the progress of the optimization in real time, the ability to guide the search toward a preferred region in run-time, parallel support to distribute the computation among nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimization. Availability and implementation Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An optimization approach to multiple sequence alignment

Applied Mathematics Letters ◽

10.1016/s0893-9659(03)00083-1 ◽

2003 ◽

Vol 16 (5) ◽

pp. 785-790 ◽

Cited By ~ 2

Author(s):

F.Y. Hunt ◽

A.J. Kearsley ◽

Honghui Wan

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Optimization Approach ◽

Multiple Sequence

Download Full-text

Protein multiple alignments: sequence-based versus structure-based programs

Bioinformatics ◽

10.1093/bioinformatics/btz236 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3970-3980 ◽

Cited By ~ 6

Author(s):

Mathilde Carpentier ◽

Jacques Chomilier

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Added Value ◽

Supplementary Information ◽

Supplementary Data ◽

Sequence Structure ◽

Multiple Sequence ◽

Sequence Identity ◽

Multiple Alignments ◽

Low Levels

Abstract Motivation Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. Results We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. Availability and implementation All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An Optimization Approach for Multiple Sequence Alignment using Divide-Conquer and Genetic Algorithm

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2021.0120458 ◽

2021 ◽

Vol 12 (4) ◽

Author(s):

Arunima Mishra ◽

Sudhir Singh ◽

Bipin Kumar

Keyword(s):

Genetic Algorithm ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Optimization Approach ◽

Multiple Sequence

Download Full-text

Code optimization of multiple sequence alignment software tool MSA_BG on GPU-accelerated computing infrastructures

10.1063/1.5133488 ◽

2019 ◽

Author(s):

Plamenka Borovska ◽

Maria Marinova ◽

Vasil Tsanov

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Software Tool ◽

Code Optimization ◽

Multiple Sequence

Download Full-text

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Bioinformatics ◽

10.1093/bioinformatics/btz863 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2105-2112 ◽

Cited By ~ 14

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

S M Mortuza ◽

Yang Li ◽

Yang Zhang

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Secondary Structure Prediction ◽

Supplementary Information ◽

Structure Identification ◽

Whole Genome ◽

Multiple Sequence ◽

Contact Prediction ◽

Homologous Sequences

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A novel two-level particle swarm optimization approach for efficient multiple sequence alignment

Memetic Computing ◽

10.1007/s12293-015-0157-y ◽

2015 ◽

Vol 7 (2) ◽

pp. 119-133 ◽

Cited By ~ 12

Author(s):

Soniya Lalwani ◽

Rajesh Kumar ◽

Nilama Gupta

Keyword(s):

Particle Swarm Optimization ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Particle Swarm ◽

Optimization Approach ◽

Multiple Sequence ◽

Swarm Optimization

Download Full-text

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa743 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Supplementary Information ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Algorithmic Techniques ◽

User Friendly

Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm

Soft Computing ◽

10.1007/s00500-020-04917-5 ◽

2020 ◽

Vol 24 (20) ◽

pp. 15871-15888

Author(s):

Biswanath Chowdhury ◽

Gautam Garai

Keyword(s):

Genetic Algorithm ◽

Objective Function ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Function Optimization ◽

Optimization Approach ◽

Multiple Sequence

Download Full-text

Logomaker: beautiful sequence logos in Python

Bioinformatics ◽

10.1093/bioinformatics/btz921 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2272-2274 ◽

Cited By ~ 23

Author(s):

Ammar Tareen ◽

Justin B Kinney

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Source Code ◽

Protein Sequences ◽

Biological Properties ◽

Programming Environment ◽

Multiple Sequence ◽

Sequence Logos ◽

Python Programming ◽

Publication Quality

Abstract Summary Sequence logos are visually compelling ways of illustrating the biological properties of DNA, RNA and protein sequences, yet it is currently difficult to generate and customize such logos within the Python programming environment. Here we introduce Logomaker, a Python API for creating publication-quality sequence logos. Logomaker can produce both standard and highly customized logos from either a matrix-like array of numbers or a multiple-sequence alignment. Logos are rendered as native matplotlib objects that are easy to stylize and incorporate into multi-panel figures. Availability and implementation Logomaker can be installed using the pip package manager and is compatible with both Python 2.7 and Python 3.6. Documentation is provided at http://logomaker.readthedocs.io; source code is available at http://github.com/jbkinney/logomaker.

Download Full-text

MAGUS: Multiple sequence Alignment using Graph clUStering

Bioinformatics ◽

10.1093/bioinformatics/btaa992 ◽

2020 ◽

Author(s):

Vladimir Smirnov ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Graph Clustering ◽

Divide And Conquer ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Full Dataset ◽

A New Technique

Abstract Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability and implementation MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text