Efficient inference, potential, and limitations of site-specific substitution models

AbstractNatural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states or only change in concert with other sites. Most commonly used evolutionary models, however, ignore much of this complexity and at best account for variation in the rate at which different sites change. Here, we present an efficient algorithm to estimate more complex models that allow for site-specific preferences and explore the accuracy at which such models can be estimated from simulated data. We find that an iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences. Ignoring site-specific preferences during estimation of branch length of phylogenetic trees – an assumption of most phylogeny software – results in substantial underestimation comparable to the error incurred when ignoring rate variation. However, the joint estimation of branch lengths, site-specific rates, and site-specific preferences can suffer from identifiability problems and is typically unable to recover the correct branch lengths. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of site-specific HIV substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.

Download Full-text

Efficient inference, potential, and limitations of site-specific substitution models

Virus Evolution ◽

10.1093/ve/veaa066 ◽

2020 ◽

Vol 6 (2) ◽

Cited By ~ 1

Author(s):

Vadim Puller ◽

Pavel Sagulenko ◽

Richard A Neher

Keyword(s):

Phylogenetic Reconstruction ◽

Simulated Data ◽

Branch Length ◽

Large Data ◽

Joint Estimation ◽

Model Complexity ◽

Sequence Evolution ◽

Evolutionary Patterns ◽

Site Specific ◽

Substitution Models

Abstract Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites. On one hand, such constraints on sequence evolution can be to infer biological function, one the other hand they need to be accounted for in phylogenetic reconstruction. Phylogenetic models often account for this complexity by partitioning sites into a small number of discrete classes with different rates and/or state preferences. Appropriate model complexity is typically determined by model selection procedures. Here, we present an efficient algorithm to estimate more complex models that allow for different preferences at every site and explore the accuracy at which such models can be estimated from simulated data. Our iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences and known topology. However, the joint estimation of site-specific rates, and site-specific preferences, and phylogenetic branch length can suffer from identifiability problems, while ignoring variation in preferences across sites results in branch length underestimates. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of these substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.

Download Full-text

Predicting the Impact of Describing New Species on Phylogenetic Patterns

Integrative Organismal Biology ◽

10.1093/iob/obz028 ◽

2019 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

D C Blackburn ◽

G Giribet ◽

D E Soltis ◽

E L Stanley

Keyword(s):

New Species ◽

Phylogenetic Trees ◽

Branch Length ◽

Length Variation ◽

Tree Shape ◽

Branch Lengths ◽

Taxonomic History ◽

Ecological Patterns ◽

The Impact ◽

Incomplete Sampling

Abstract Although our inventory of Earth’s biodiversity remains incomplete, we still require analyses using the Tree of Life to understand evolutionary and ecological patterns. Because incomplete sampling may bias our inferences, we must evaluate how future additions of newly discovered species might impact analyses performed today. We describe an approach that uses taxonomic history and phylogenetic trees to characterize the impact of past species discoveries on phylogenetic knowledge using patterns of branch-length variation, tree shape, and phylogenetic diversity. This provides a framework for assessing the relative completeness of taxonomic knowledge of lineages within a phylogeny. To demonstrate this approach, we use recent large phylogenies for amphibians, reptiles, flowering plants, and invertebrates. Well-known clades exhibit a decline in the mean and range of branch lengths that are added each year as new species are described. With increased taxonomic knowledge over time, deep lineages of well-known clades become known such that most recently described new species are added close to the tips of the tree, reflecting changing tree shape over the course of taxonomic history. The same analyses reveal other clades to be candidates for future discoveries that could dramatically impact our phylogenetic knowledge. Our work reveals that species are often added non-randomly to the phylogeny over multiyear time-scales in a predictable pattern of taxonomic maturation. Our results suggest that we can make informed predictions about how new species will be added across the phylogeny of a given clade, thus providing a framework for accommodating unsampled undescribed species in evolutionary analyses.

Download Full-text

Polynomial Phylogenetic Analysis of Tree Shapes

10.1101/2020.02.10.942367 ◽

2020 ◽

Author(s):

Pengyu Liu ◽

Priscila Biller ◽

Matthew Gould ◽

Caroline Colijn

Keyword(s):

Phylogenetic Analysis ◽

Evolutionary Biology ◽

Phylogenetic Trees ◽

Rooted Trees ◽

Evolutionary Patterns ◽

Sequencing Technologies ◽

Branch Lengths ◽

Discrete Structures ◽

Best Fit ◽

Central Tool

AbstractPhylogenetic trees are a central tool in evolutionary biology. They demonstrate evolutionary patterns among species, genes, and with modern sequencing technologies, patterns of ancestry among sets of individuals. Phylogenetic trees usually consist of tree shapes, branch lengths and partial labels. Comparing tree shapes is a challenging aspect of comparing phylogenetic trees as there are few tools to describe tree shapes in a quantitative, accurate, comprehensive and easy-to-interpret way. Current methods to compare tree shapes are often based on scalar indices reflecting tree imbalance, and on frequencies of small subtrees. In this paper, we present tree comparisons and applications based on a polynomial that fully characterizes trees. Polynomials are important tools to describe discrete structures and have been used to study various objects including graphs and knots. There are also polynomials that describe rooted trees. We use tree-defining polynomials to compare tree shapes randomly generated by simulations and tree shapes reconstructed from data. Moreover, we show that the comparisons can be used to estimate parameters and to select the best-fit model that generates specific tree shapes.

Download Full-text

A simple island biodiversity model is robust to trait dependence in diversification and colonization rates

10.1101/2022.01.01.474685 ◽

2022 ◽

Author(s):

Shu Xie ◽

Luis Valente ◽

Rampal Etienne

Keyword(s):

Simulation Model ◽

Phylogenetic Trees ◽

Size Variation ◽

Simulated Data ◽

Estimation Accuracy ◽

Rate Variation ◽

State Dependent ◽

Island Biodiversity ◽

Speciation Rates ◽

Diversity Dynamics

The application of state-dependent speciation and extinction (SSE) models to phylogenetic trees has revealed an important role for traits in diversification. However, this role remains comparatively unexplored on islands, which can include multiple independent clades resulting from different colonization events. Here, we perform a robustness study to identify how trait-dependence in rates of island colonization, extinction and speciation (CES rates) affects the estimation accuracy of a phylogenetic model that assumes no rate variation between trait states. We extend the DAISIE (Dynamic Assembly of Islands through Speciation, Immigration and Extinction) simulation model to include state-dependent rates, and evaluate the robustness of the DAISIE inference model using simulated data. Our results show that when the CES rate differences between trait states are moderate, DAISIE shows negligible error for a variety of island diversity metrics. However, for large differences in speciation rates, we find large errors when reconstructing clade size variation and non-endemic species diversity through time. We conclude that for many biologically realistic scenarios with trait-dependent speciation and colonization, island diversity dynamics can be accurately estimated without the need to explicitly model trait dynamics. Nonetheless, our new simulation model may provide a useful tool for studying patterns of trait variation.

Download Full-text

Comparative analyses of phenotypic sequences using phylogenetic trees

10.1101/561167 ◽

2019 ◽

Cited By ~ 1

Author(s):

Daniel S. Caetano ◽

Jeremy M. Beaulieu

Keyword(s):

Phylogenetic Trees ◽

Small Sample ◽

Dental Arch ◽

Rate Variation ◽

Multiple Sources ◽

Evolutionary Patterns ◽

Sequence Organization ◽

Rates Of Evolution ◽

Small Sample Sizes ◽

Multivariate Traits

AbstractPhenotypic sequences are a type of multivariate trait organized structurally, such as teeth distributed along the dental arch, or temporally, such as the stages of an ontogenetic series. However, unlike other multivariate traits, the elements of a phenotypic sequence are arranged along a vector, which allows for distinct evolutionary patterns between neighboring and distant positions. In fact, sequence traits share many characteristics with molecular sequences. We implement an approach to estimate rates of trait evolution that explicitly incorporates the sequence organization of traits. We apply models to study the temporal pattern evolution of cricket calling songs. We test whether songs show autocorrelation of rates (i.e., neighboring positions along a phenotypic sequence have correlated rates of evolution), or if they are best described by rate variation independent of sequence position. Our results show that models perform well when used with sequence phenotypes even under small sample sizes. We also show that silent regions of the songs evolve faster than chirp regions, which suggests that macroevolutionary changes are faster when associated with axes of variation less constrained by multiple sources of selection. Our approach is flexible and can be applied to any multivariate trait with units organized in a sequence-like structure.

Download Full-text

Moduli Spaces of Phylogenetic Trees Describing Tumor Evolutionary Patterns

Brain Informatics and Health - Lecture Notes in Computer Science ◽

10.1007/978-3-319-09891-3_48 ◽

2014 ◽

pp. 528-539 ◽

Cited By ~ 6

Author(s):

Sakellarios Zairis ◽

Hossein Khiabanian ◽

Andrew J. Blumberg ◽

Raul Rabadan

Keyword(s):

Moduli Spaces ◽

Phylogenetic Trees ◽

Evolutionary Patterns

Download Full-text

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text

Reliable confidence intervals for RelTime estimates of evolutionary divergence times

10.1101/677286 ◽

2019 ◽

Cited By ~ 1

Author(s):

Qiqing Tao ◽

Koichiro Tamura ◽

Beatriz Mello ◽

Sudhir Kumar

Keyword(s):

Confidence Intervals ◽

Divergence Time ◽

Simulated Data ◽

Molecular Dating ◽

Divergence Times ◽

Rate Variation ◽

Evolutionary Divergence ◽

Posterior Density ◽

Divergence Time Estimates ◽

Highest Posterior Density

AbstractConfidence intervals (CIs) depict the statistical uncertainty surrounding evolutionary divergence time estimates. They capture variance contributed by the finite number of sequences and sites used in the alignment, deviations of evolutionary rates from a strict molecular clock in a phylogeny, and uncertainty associated with clock calibrations. Reliable tests of biological hypotheses demand reliable CIs. However, current non-Bayesian methods may produce unreliable CIs because they do not incorporate rate variation among lineages and interactions among clock calibrations properly. Here, we present a new analytical method to calculate CIs of divergence times estimated using the RelTime method, along with an approach to utilize multiple calibration uncertainty densities in these analyses. Empirical data analyses showed that the new methods produce CIs that overlap with Bayesian highest posterior density (HPD) intervals. In the analysis of computer-simulated data, we found that RelTime CIs show excellent average coverage probabilities, i.e., the true time is contained within the CIs with a 95% probability. These developments will encourage broader use of computationally-efficient RelTime approach in molecular dating analyses and biological hypothesis testing.

Download Full-text

GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss

Molecular Biology and Evolution ◽

10.1093/molbev/msaa141 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2763-2774 ◽

Cited By ~ 5

Author(s):

Benoit Morel ◽

Alexey M Kozlov ◽

Alexandros Stamatakis ◽

Gergely J Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Download Full-text

In SilicoIdentification of Functional Protein Interfaces

Comparative and Functional Genomics ◽

10.1002/cfg.309 ◽

2003 ◽

Vol 4 (4) ◽

pp. 420-423 ◽

Cited By ~ 14

Author(s):

Rachel E. Bell ◽

Nir Ben-Tal

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Hypothetical Protein ◽

Evolutionary Information ◽

Structural Constraints ◽

Functional Protein ◽

Homologous Proteins ◽

Geometrical Properties ◽

Protein Interfaces ◽

Branch Lengths

Proteins perform many of their biological roles through protein–protein, protein–DNA or protein–ligand interfaces. The identification of the amino acids comprising these interfaces often enhances our understanding of the biological function of the proteins. Many methods for the detection of functional interfaces have been developed, and large-scale analyses have provided assessments of their accuracy. Among them are those that consider the size of the protein interface, its amino acid composition and its physicochemical and geometrical properties. Other methods to this effect use statistical potential functions of pairwise interactions, and evolutionary information. The rationale of the evolutionary approach is that functional and structural constraints impose selective pressure; hence, biologically important interfaces often evolve at a slower pace than do other external regions of the protein. Recently, an algorithm, Rate4Site, and a web-server, ConSurf (http://consurf.tau.ac.il/), for the identification of functional interfaces based on the evolutionary relations among homologous proteins as reflected in phylogenetic trees, were developed in our laboratory. The explicit use of the tree topology and branch lengths makes the method remarkably accurate and sensitive. Here we demonstrate its potency in the identification of the functional interfaces of a hypothetical protein, the structure of which was determined as part of the international structural genomics effort. Finally, we propose to combine complementary procedures, in order to enhance the overall performance of methods for the identification of functional interfaces in proteins.

Download Full-text