QMaker: Fast and accurate method to estimate empirical models of protein evolution

Mapping Intimacies ◽

10.1101/2020.02.20.958819 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Models ◽

Protein Alignments

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.

Download Full-text

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Systematic Biology ◽

10.1093/sysbio/syab010 ◽

2021 ◽

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Data Set ◽

Substitution Models

Abstract Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.]

Download Full-text

nQMaker: estimating time non-reversible amino acid substitution models

10.1101/2021.10.18.464754 ◽

2021 ◽

Author(s):

Cuong Cao Dang ◽

Bui Quang Minh ◽

Hanon McShea ◽

Joanna Masel ◽

Jennifer Eleanor James ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Sequence Alignments ◽

Biological Reality ◽

Wide Range ◽

Substitution Models ◽

Likelihood Approach ◽

Protein Datasets

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.

Download Full-text

Morphological redescription and molecular characterization of Trichodina matsu Basson & Van As, 1994 (Ciliophora, Mobilida, Trichodinidae) infecting Tachysurus fulvidraco (Richardson, 1846) from Chongqing, China

Zootaxa ◽

10.11646/zootaxa.4995.2.6 ◽

2021 ◽

Vol 4995 (2) ◽

pp. 334-344

Author(s):

QIAN ZHOU ◽

FAHUI TANG ◽

YUANJUN ZHAO

Keyword(s):

Phylogenetic Analyses ◽

Molecular Data ◽

18S Rrna Gene ◽

Rrna Gene ◽

Similar Species ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Morphological And Molecular Data ◽

Tachysurus Fulvidraco

During a survey of parasitic ciliates in Chongqing, China, Trichodina matsu Basson & Van As, 1994 was isolated from gills of Tachysurus fulvidraco. Furthermore, the 18S rRNA gene and ITS-5.8S rRNA region of T. matsu were sequenced for the first time and applied for the species identification and comparison with similar species in the present study. Based on the morphological and molecular comparisons, the results indicate that T. matsu is an ectoparasite specific for the Siluriformes catfish. Based on the analyses of genetic distance, multiple sequence alignments, and phylogenetic analyses, no obvious differentiation within populations of T. matsu was found. In addition, the ‘Trichodina hyperparasitis’ (KX904933) in GenBank is a misidentification and appears to be conspecific with T. matsu according to the comparison of morphological and molecular data.

Download Full-text

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text

RELATIVE VON NEUMANN ENTROPY FOR EVALUATING AMINO ACID CONSERVATION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001000494x ◽

2010 ◽

Vol 08 (05) ◽

pp. 809-823 ◽

Cited By ~ 13

Author(s):

FREDRIK JOHANSSON ◽

HIROYUKI TOH

Keyword(s):

Amino Acid ◽

Sequence Analysis ◽

Shannon Entropy ◽

Von Neumann Entropy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Von Neumann ◽

Multiple Sequence Alignments ◽

Previous Definition ◽

Definition Of

The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.

Download Full-text

Phylogenetics of scolopendromorph centipedes: can denser taxon sampling improve an artificial classification?

Invertebrate Systematics ◽

10.1071/is13035 ◽

2013 ◽

Vol 27 (5) ◽

pp. 578 ◽

Cited By ~ 21

Author(s):

Varpu Vahtera ◽

Gregory D. Edgecombe ◽

Gonzalo Giribet

Keyword(s):

Bayesian Inference ◽

New World ◽

Phylogenetic Analyses ◽

Morphological Characters ◽

Taxon Sampling ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Artificial Classification

Previous phylogenetic analyses of the centipede order Scolopendromorpha indicated a fundamental division into blind and ocellate clades. These analyses corroborated the monophyly of most families and tribes but suggested that several species-rich, cosmopolitan genera in traditional and current classifications are polyphyletic. Denser taxon sampling is applied to a dataset of 122 morphological characters and sequences for four nuclear and mitochondrial loci. Phylogenetic analyses including 98 species and subspecies of Scolopendromorpha employ parsimony under dynamic and static homology schemes as well as maximum likelihood and Bayesian inference of multiple sequence alignments. The monotypic Australian genera Notiasemus and Kanparka nest within Cormocephalus and Scolopendra, respectively, and the New Caledonian Campylostigmus is likewise a clade within Cormocephalus. New World Scolopendra are more closely related to Hemiscolopendra and Arthrorhabdus than to Scolopendra s.s., which is instead closely allied to Asanada; the tribe Asanadini nests within Scolopendrini for molecular and combined datasets. The generic classification of Otostigmini has a poor fit to phylogenetic relationships, although nodal support within this tribe is weak. New synonymies are proposed for Ectonocryptopinae Shelley & Mercurio, 2005 (= Newportiinae Pocock, 1896), Asanadini Verhoeff, 1907 (= Scolopendrini Leach, 1814), and Kanparka Waldock & Edgecombe, 2012 (= Scolopendra Linnaeus, 1758). Scolopendrid systematics largely depicts incongruence between phylogeny and classification rather than between morphology and molecules.

Download Full-text

Experimental Assessment of the Importance of Amino Acid Positions Identified by an Entropy-Based Correlation Analysis of Multiple-Sequence Alignments

Biochemistry ◽

10.1021/bi300747r ◽

2012 ◽

Vol 51 (28) ◽

pp. 5633-5641 ◽

Cited By ~ 11

Author(s):

Susanne Dietrich ◽

Nadine Borst ◽

Sandra Schlee ◽

Daniel Schneider ◽

Jan-Oliver Janda ◽

...

Keyword(s):

Amino Acid ◽

Correlation Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Experimental Assessment ◽

Multiple Sequence Alignments

Download Full-text

Relative model selection can be sensitive to multiple sequence alignment uncertainty

10.1101/2021.08.04.455051 ◽

2021 ◽

Author(s):

Stephanie J Spielman ◽

Molly Miraglia

Keyword(s):

Model Selection ◽

Phylogenetic Analyses ◽

Phylogenetic Reconstruction ◽

Selection Procedure ◽

Evolutionary Model ◽

Ancestral State ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Nucleotide Data

Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored. In this study, we assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA. We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.

Download Full-text

Characterization of two novel EF-hand proteins identifies a clade of putative Ca2+-binding protein specific to the Ambulacraria

10.1101/2020.05.22.110411 ◽

2020 ◽

Cited By ~ 1

Author(s):

Arisnel Soto-Acabá ◽

Pablo A. Ortíz-Pineda ◽

José E. García-Arrarás

Keyword(s):

Calcium Binding ◽

Phylogenetic Analyses ◽

Sequence Alignments ◽

Multiple Sequence ◽

Significant Similarity ◽

Significant Differential Expression ◽

Multiple Sequence Alignments ◽

Ef Hand ◽

Protein Discovery

AbstractIn recent years, transcriptomic databases have become one of the main sources for protein discovery. In our studies of nervous system and digestive tract regeneration in echinoderms, we have identified several transcripts that have attracted our attention. One of these molecules corresponds to a previously unidentified transcript (Orpin) from the sea cucumber Holothuria glaberrima that appeared to be upregulated during intestinal regeneration. We have now identified a second highly similar sequence and analyzed the predicted proteins using bioinformatics tools. Both sequences have EF-hand motifs characteristic of calcium-binding proteins (CaBPs) and N-terminal signal peptides. Sequence comparison analyses such as multiple sequence alignments and phylogenetic analyses only showed significant similarity to sequences from other echinoderms or from hemichordates. Semi-quantitative RT-PCR analyses revealed that transcripts from these sequences are expressed in various tissues including muscle, haemal system, gonads, and mesentery. However, contrary to previous reports, there was no significant differential expression in regenerating tissues. Nonetheless, the identification of unique features in the predicted proteins suggests that these might comprise a novel subfamily of EF-hand containing proteins specific to the Ambulacraria clade.

Download Full-text

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Molecules ◽

10.3390/molecules24010104 ◽

2018 ◽

Vol 24 (1) ◽

pp. 104

Author(s):

Patrice Koehl ◽

Henri Orland ◽

Marc Delarue

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Principal Components ◽

Gaussian Model ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Matrices ◽

Multivariate Gaussian ◽

Multivariate Gaussian Model

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.

Download Full-text