scholarly journals SAliBASE: A Database of Simulated Protein Alignments

2019 ◽  
Vol 15 ◽  
pp. 117693431882108 ◽  
Author(s):  
Muhammad Tariq Pervez ◽  
Hayat Ali Shah ◽  
Masroor Ellahi Babar ◽  
Nasir Naveed ◽  
Muhammad Shoaib

Simulated alignments are alternatives to manually constructed multiple sequence alignments for evaluating performance of multiple sequence alignment tools. The importance of simulated sequences is recognized because their true evolutionary history is known, which is very helpful for reconstructing accurate phylogenetic trees and alignments. However, generating simulated alignments require expertise to use bioinformatics tools and consume several hours for reconstructing even a few hundreds of simulated sequences. It becomes a tedious job for an end user who needs a few datasets of variety of simulated sequences. Currently, there is no databank available which may help researchers to download simulated sequences/alignments for their study. Major focus of our study was to develop a database of simulated protein sequences (SAliBASE) based on different varying parameters such as insertion rate, deletion rate, sequence length, number of sequences, and indel size. Each dataset has corresponding alignment as well. This repository is very useful for evaluating multiple alignment methods.

2020 ◽  
Vol 21 (2) ◽  
pp. 513 ◽  
Author(s):  
Marzia Tindara Venuto ◽  
Mathieu Decloquement ◽  
Joan Martorell Ribera ◽  
Maxence Noel ◽  
Alexander Rebl ◽  
...  

We identified and analyzed α2,8-sialyltransferases sequences among 71 ray-finned fish species to provide the first comprehensive view of the Teleost ST8Sia repertoire. This repertoire expanded over the course of Vertebrate evolution and was primarily shaped by the whole genome events R1 and R2, but not by the Teleost-specific R3. We showed that duplicated st8sia genes like st8sia7, st8sia8, and st8sia9 have disappeared from Tetrapods, whereas their orthologues were maintained in Teleosts. Furthermore, several fish species specific genome duplications account for the presence of multiple poly-α2,8-sialyltransferases in the Salmonidae (ST8Sia II-r1 and ST8Sia II-r2) and in Cyprinus carpio (ST8Sia IV-r1 and ST8Sia IV-r2). Paralogy and synteny analyses provided more relevant and solid information that enabled us to reconstruct the evolutionary history of st8sia genes in fish genomes. Our data also indicated that, while the mammalian ST8Sia family is comprised of six subfamilies forming di-, oligo-, or polymers of α2,8-linked sialic acids, the fish ST8Sia family, amounting to a total of 10 genes in fish, appears to be much more diverse and shows a patchy distribution among fish species. A focus on Salmonidae showed that (i) the two copies of st8sia2 genes have overall contrasted tissue-specific expressions, with noticeable changes when compared with human co-orthologue, and that (ii) st8sia4 is weakly expressed. Multiple sequence alignments enabled us to detect changes in the conserved polysialyltransferase domain (PSTD) of the fish sequences that could account for variable enzymatic activities. These data provide the bases for further functional studies using recombinant enzymes.


2019 ◽  
Author(s):  
Anton Suvorov ◽  
Joshua Hochuli ◽  
Daniel R. Schrider

AbstractReconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.


Author(s):  
Jacob L Steenwyk ◽  
Thomas J Buida ◽  
Abigail L Labella ◽  
Yuanning Li ◽  
Xing-Xing Shen ◽  
...  

Abstract Motivation Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. Results To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. Availability PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). Supplementary information Supplementary data are available on figshare (doi: 10.6084/m9.figshare.13118600) and are available at Bioinformatics online.


Author(s):  
Bui Quang Minh ◽  
Cuong Cao Dang ◽  
Le Sy Vinh ◽  
Robert Lanfear

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.


2020 ◽  
Author(s):  
Lalitha Guruprasad

<div>Coronavirus disease 2019 (COVID-19) is a pandemic infectious disease caused by novel Severe Acute Respiratory Syndrome coronavirus-2 (SARS CoV-2). The SARS CoV-2 is transmitted more rapidly and readily than SARS CoV. Both, SARS CoV and SARS CoV-2 via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor. We generated multiple sequence alignments and phylogenetic trees for representative spike proteins of CoV and CoV-2 from various host sources in order to analyze the specificity in SARS CoV-2 spike proteins required for causing infection in humans. Our results show that two sequence motifs in the N-terminal domain; "MESEFR" and "SYLTPG" are specific to human SARS CoV-2 and pangolin SARS CoV. In the receptor binding domain (RBD), three sequence loops; VGGNY (loop 1), YQAGSTPC (loop 2), EGFNCY (loop 3) and a tethered disulfide bridge Cys480-Cys488 connecting loops 2 and 3 are structural determinants for the recognition of human ACE-2 receptor. The complete genome analysis of representative SARS CoVs from bat, civet, pangolin, human host sources and human SARS CoV-2 identified the bat genome (GenBank code: MN996532.1) and the pangolin SARS CoV genomes as closest to the recent novel human SARS CoV-2 genomes. The bat CoV genomes (GenBank codes: MG772933 and MG772934) are evolutionary intermediates in the mutagenesis progression towards becoming human SARS CoV-2. </div>


2019 ◽  
Vol 5 ◽  
Author(s):  
Alexis Criscuolo

This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.


Author(s):  
Anton Suvorov ◽  
Joshua Hochuli ◽  
Daniel R Schrider

Abstract Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.


Sign in / Sign up

Export Citation Format

Share Document