Protein multiple alignments: sequence-based versus structure-based programs

Mathilde Carpentier; Jacques Chomilier

doi:10.1093/bioinformatics/btz236

Protein multiple alignments: sequence-based versus structure-based programs

Bioinformatics ◽

10.1093/bioinformatics/btz236 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3970-3980 ◽

Cited By ~ 6

Author(s):

Mathilde Carpentier ◽

Jacques Chomilier

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Added Value ◽

Supplementary Information ◽

Supplementary Data ◽

Sequence Structure ◽

Multiple Sequence ◽

Sequence Identity ◽

Multiple Alignments ◽

Low Levels

Abstract Motivation Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. Results We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. Availability and implementation All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Protein Multiple Alignments: Sequence-based vs Structure-based Programs

10.1101/413369 ◽

2018 ◽

Author(s):

Mathilde Carpentier ◽

Jacques Chomilier

Keyword(s):

Multiple Sequence Alignment ◽

Added Value ◽

Sequence Structure ◽

Multiple Sequence ◽

Sequence Identity ◽

Multiple Alignments ◽

Reference Databases ◽

Low Levels ◽

Existing Data ◽

Proteins Classification

ABSTRACTFacing the huge increase of information about proteins, classification has reached the level of a compulsory task, essential for assigning a function to a given sequence, by means of comparison to existing data. Multiple sequence alignment programs have been proven to be very useful and they have already been evaluated. In this paper we wished to evaluate the added value provided by taking into account structures. We compared the multiple alignments resulting from 24 programs, either based on sequence, structure, or both, to reference alignments deposited in five databases. Reference databases, on their side, can be split in two: more automatic ones, and more manually ones. Scores have been attributed to each program. As a global rule of thumb, five groups of methods emerge, with the lead to two of the structure-based programs. This advantage is increased at low levels of sequence identity among aligned proteins, or for residues in regular secondary structures or buried. Concerning gap management, sequence-based programs place less gaps than structure-based programs. Concerning the databases, the alignments from the manually built databases are the more challenging for the programs.

Download Full-text

Sequoya: multiobjective multiple sequence alignment in Python

Bioinformatics ◽

10.1093/bioinformatics/btaa257 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3892-3893

Author(s):

Antonio Benítez-Hidalgo ◽

Antonio J Nebro ◽

José F Aldana-Montes

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Software Tool ◽

Computing System ◽

Supplementary Information ◽

Optimization Approach ◽

Multiple Sequence ◽

Graphical Tool ◽

Optimal Alignments ◽

Python Programming

Abstract Motivation Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimization problem with NP-hard complexity (non-deterministic polynomial-time hardness), because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multiobjective when more than one score is considered to assess the quality of an alignment, such as maximizing the percentage of totally conserved columns and minimizing the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a nonexact stochastic optimization approach that has proven to be effective to solve multiobjective problems. Results The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualization and parallelism. Thus, Sequoya offers a graphical tool to visualize the progress of the optimization in real time, the ability to guide the search toward a preferred region in run-time, parallel support to distribute the computation among nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimization. Availability and implementation Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MULTIPLE SEQUENCE ALIGNMENT USING AN EXHAUSTIVE AND GREEDY ALGORITHM

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972000500103x ◽

2005 ◽

Vol 03 (02) ◽

pp. 243-255 ◽

Cited By ~ 1

Author(s):

YI WANG ◽

KUO-BIN LI

Keyword(s):

Greedy Algorithm ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Alignment ◽

Initial Alignment ◽

Progressive Alignment ◽

Multiple Sequence ◽

Java Programming ◽

Multiple Alignments ◽

Objective Score

We describe an exhaustive and greedy algorithm for improving the accuracy of multiple sequence alignment. A simple progressive alignment approach is employed to provide initial alignments. The initial alignment is then iteratively optimized against an objective function. For any working alignment, the optimization involves three operations: insertions, deletions and shuffles of gaps. The optimization is exhaustive since the algorithm applies the above operations to all eligible positions of an alignment. It is also greedy since only the operation that gives the best improving objective score will be accepted. The algorithms have been implemented in the EGMA (Exhaustive and Greedy Multiple Alignment) package using Java programming language, and have been evaluated using the BAliBASE benchmark alignment database. Although EGMA is not guaranteed to produce globally optimized alignment, the tests indicate that EGMA is able to build alignments with high quality consistently, compared with other commonly used iterative and non-iterative alignment programs. It is also useful for refining multiple alignments obtained by other methods.

Download Full-text

A Simple Genetic Algorithm for Optimizing Multiple Sequence Alignment on the Spread of the SARS Epidemic

The Open Bioinformatics Journal ◽

10.2174/1875036201912010030 ◽

2019 ◽

Vol 12 (1) ◽

pp. 30-39

Author(s):

Siti Amiroch ◽

M. Syaiful Pradana ◽

M. Isa Irawan ◽

Imam Mukhlash

Keyword(s):

Genetic Algorithms ◽

Phylogenetic Tree ◽

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Network System ◽

Multiple Sequence ◽

Network Analyses ◽

Multiple Alignments ◽

Mutation Region

Background:Multiple sequence alignment is a method of getting genomic relationships between 3 sequences or more. In multiple alignments, there are 3 mutation network analyses, namely topological network system, mutation region network and network system of mutation mode. In general, the three analyses show stable and unstable regions that map mutation regions. This area of mutation is described further in a phylogenetic tree which simultaneously illustrates the path of the spread of an epidemic, the Severe Acute Respiratory Syndrome (SARS) epidemic. The process of spreading the SARS viruses, in this case, is described as the process of phylogenetic tree formation, and as a novelty of this research, multiple alignments in the process are analyzed in detail and then optimized with genetic algorithms.Methods:The data used to form the phylogenetic tree for the spread of the SARS epidemic are 14 DNA sequences which are then optimized by using genetic algorithms. The phylogenetic tree is constructed by using the neighbor-joining algorithm with a distance matrix that the intended distance is the genetic distance obtained from sequence alignment by using the Needleman Wunsch Algorithm.Results & Conclusion:The results of the analysis obtained 3649 stable areas and 19 unstable areas. The results of phylogenetic tree from the network system analysis indicated that the spread of the SARS epidemic extended from Guangzhou 16/12/02 to Zhongshan 27/12/02, then spread simultaneously to Guangzhou 18/02/03 and Guangzhou hospital. After that, the virus reached Metropole, Zhongshan, Hongkong, Singapore, Taiwan, Hong kong, and Hanoi which then continued to Guangzhou 01/01/03 and Toronto at once. The results of the mutation region network system demonstrate decomposition of orthogonal mutations in the 1st order arc.

Download Full-text

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Bioinformatics ◽

10.1093/bioinformatics/btz863 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2105-2112 ◽

Cited By ~ 14

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

S M Mortuza ◽

Yang Li ◽

Yang Zhang

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Secondary Structure Prediction ◽

Supplementary Information ◽

Structure Identification ◽

Whole Genome ◽

Multiple Sequence ◽

Contact Prediction ◽

Homologous Sequences

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa743 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Supplementary Information ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Algorithmic Techniques ◽

User Friendly

Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MAGUS: Multiple sequence Alignment using Graph clUStering

Bioinformatics ◽

10.1093/bioinformatics/btaa992 ◽

2020 ◽

Author(s):

Vladimir Smirnov ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Graph Clustering ◽

Divide And Conquer ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Full Dataset ◽

A New Technique

Abstract Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability and implementation MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Multiple Sequence Alignment and Profile Analysis of Protein Family Utsing Hidden Markov Model

International Journal of Scientific Research ◽

10.15373/22778179/june2013/66 ◽

2012 ◽

Vol 2 (6) ◽

pp. 208-211

Author(s):

Navjot Kaur ◽

◽

Rajbir Singh Cheema ◽

Harmandeep Singh Harmandeep Singh

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Profile Analysis ◽

Hidden Markov ◽

Protein Family ◽

Multiple Sequence

Download Full-text

Faculty Opinions recommendation of MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.731078852.793536612 ◽

2017 ◽

Author(s):

Feng Gao

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Online Service ◽

Multiple Sequence

Download Full-text

Computational Analysis of Therapeutic Enzyme Uricase from Different Source Organisms

Current Proteomics ◽

10.2174/1570164616666190617165107 ◽

2020 ◽

Vol 17 (1) ◽

pp. 59-77

Author(s):

Anand Kumar Nelapati ◽

JagadeeshBabu PonnanEttiyappan

Keyword(s):

Uric Acid ◽

Amino Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Amino Acid Residues ◽

Multiple Sequence ◽

Physiochemical Properties ◽

Pharmaceutical Industries

Background:Hyperuricemia and gout are the conditions, which is a response of accumulation of uric acid in the blood and urine. Uric acid is the product of purine metabolic pathway in humans. Uricase is a therapeutic enzyme that can enzymatically reduces the concentration of uric acid in serum and urine into more a soluble allantoin. Uricases are widely available in several sources like bacteria, fungi, yeast, plants and animals.Objective:The present study is aimed at elucidating the structure and physiochemical properties of uricase by insilico analysis.Methods:A total number of sixty amino acid sequences of uricase belongs to different sources were obtained from NCBI and different analysis like Multiple Sequence Alignment (MSA), homology search, phylogenetic relation, motif search, domain architecture and physiochemical properties including pI, EC, Ai, Ii, and were performed.Results:Multiple sequence alignment of all the selected protein sequences has exhibited distinct difference between bacterial, fungal, plant and animal sources based on the position-specific existence of conserved amino acid residues. The maximum homology of all the selected protein sequences is between 51-388. In singular category, homology is between 16-337 for bacterial uricase, 14-339 for fungal uricase, 12-317 for plants uricase, and 37-361 for animals uricase. The phylogenetic tree constructed based on the amino acid sequences disclosed clusters indicating that uricase is from different source. The physiochemical features revealed that the uricase amino acid residues are in between 300- 338 with a molecular weight as 33-39kDa and theoretical pI ranging from 4.95-8.88. The amino acid composition results showed that valine amino acid has a high average frequency of 8.79 percentage compared to different amino acids in all analyzed species.Conclusion:In the area of bioinformatics field, this work might be informative and a stepping-stone to other researchers to get an idea about the physicochemical features, evolutionary history and structural motifs of uricase that can be widely used in biotechnological and pharmaceutical industries. Therefore, the proposed in silico analysis can be considered for protein engineering work, as well as for gout therapy.

Download Full-text