scholarly journals Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model

2020 ◽  
Author(s):  
Jordan Douglas ◽  
Rong Zhang ◽  
Remco Bouckaert

AbstractUncorrelated relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).Author summaryBiological sequences, such as DNA, accumulate mutations over generations. By comparing such sequences in a phylogenetic framework, the evolutionary tree of lifeforms can be inferred. With the overwhelming availability of biological sequence data, and the increasing affordability of collecting new data, the development of fast and efficient phylogenetic algorithms is more important than ever. In this article we focus on the relaxed clock model, which is very popular in phylogenetics. We explored how a range of optimisations can improve the statistical inference of the relaxed clock. This work has produced a phylogenetic setup which can infer parameters related to the relaxed clock up to 65 times faster than previous setups, depending on the dataset. The methods introduced adapt to the dataset during computation and are highly efficient when processing long biological sequences.

2021 ◽  
Vol 17 (2) ◽  
pp. e1008322
Author(s):  
Jordan Douglas ◽  
Rong Zhang ◽  
Remco Bouckaert

Relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. Under the (uncorrelated) relaxed clock model, tree branches are associated with molecular substitution rates which are independently and identically distributed. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).


2020 ◽  
Author(s):  
Eli N. Weinstein ◽  
Debora S. Marks

AbstractLarge-scale sequencing has revealed extraordinary diversity among biological sequences, produced over the course of evolution and within the lifetime of individual organisms. Existing methods for building statistical models of sequences often pre-process the data using multiple sequence alignment, an unreliable approach for many genetic elements (antibodies, disordered proteins, etc.) that is subject to fundamental statistical pathologies. Here we introduce a structured emission distribution (the MuE distribution) that accounts for mutational variability (substitutions and indels) and use it to construct generative and predictive hierarchical Bayesian models (H-MuE models). Our framework enables the application of arbitrary continuous-space vector models (e.g. linear regression, factor models, image neural-networks) to unaligned sequence data. Theoretically, we show that the MuE generalizes classic probabilistic alignment models. Empirically, we show that H-MuE models can infer latent representations and features for immune repertoires, predict functional unobserved members of disordered protein families, and forecast the future evolution of pathogens.


Author(s):  
Ashesh Nandy

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.


Paleobiology ◽  
2021 ◽  
pp. 1-13
Author(s):  
Chi Zhang

Abstract Relaxed clock models are fundamental in Bayesian clock dating, but a single distribution characterizing the clock variation is typically selected. Hence, I developed a new reversible-jump Markov chain Monte Carlo (rjMCMC) algorithm for drawing posterior samples between the independent lognormal (ILN) and independent gamma rates (IGR) clock models. The ability of the rjMCMC algorithm to infer the true model was verified through simulations. I then applied the algorithm to the Mesozoic bird data previously analyzed under the white noise (WN) clock model. In comparison, averaging over the ILN and IGR models provided more reliable estimates of the divergence times and evolutionary rates. The ILN model showed slightly better fit than the IGR model and much better fit than the autocorrelated lognormal (ALN) clock model. When the data were partitioned, different partitions showed heterogeneous model fit for ILN and IGR clocks. The implementation provides a general framework for selecting and averaging relaxed clock models in Bayesian dating analyses.


2019 ◽  
Vol 14 (4) ◽  
pp. 574-589
Author(s):  
Linyan Xue ◽  
Xiaoke Zhang ◽  
Fei Xie ◽  
Shuang Liu ◽  
Peng Lin

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.


2020 ◽  
Vol 38 (1) ◽  
pp. 307-317
Author(s):  
Xavier Didelot ◽  
Igor Siveroni ◽  
Erik M Volz

Abstract Phylogenetic dating is one of the most powerful and commonly used methods of drawing epidemiological interpretations from pathogen genomic data. Building such trees requires considering a molecular clock model which represents the rate at which substitutions accumulate on genomes. When the molecular clock rate is constant throughout the tree then the clock is said to be strict, but this is often not an acceptable assumption. Alternatively, relaxed clock models consider variations in the clock rate, often based on a distribution of rates for each branch. However, we show here that the distributions of rates across branches in commonly used relaxed clock models are incompatible with the biological expectation that the sum of the numbers of substitutions on two neighboring branches should be distributed as the substitution number on a single branch of equivalent length. We call this expectation the additivity property. We further show how assumptions of commonly used relaxed clock models can lead to estimates of evolutionary rates and dates with low precision and biased confidence intervals. We therefore propose a new additive relaxed clock model where the additivity property is satisfied. We illustrate the use of our new additive relaxed clock model on a range of simulated and real data sets, and we show that using this new model leads to more accurate estimates of mean evolutionary rates and ancestral dates.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0239883
Author(s):  
Reece K. Hart ◽  
Andreas Prlić

Motivation Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility. Results Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets. Availability SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.


2016 ◽  
Vol 371 (1699) ◽  
pp. 20150132 ◽  
Author(s):  
Nicolas Lartillot ◽  
Matthew J. Phillips ◽  
Fredrik Ronquist

Over recent years, several alternative relaxed clock models have been proposed in the context of Bayesian dating. These models fall in two distinct categories: uncorrelated and autocorrelated across branches. The choice between these two classes of relaxed clocks is still an open question. More fundamentally, the true process of rate variation may have both long-term trends and short-term fluctuations, suggesting that more sophisticated clock models unfolding over multiple time scales should ultimately be developed. Here, a mixed relaxed clock model is introduced, which can be mechanistically interpreted as a rate variation process undergoing short-term fluctuations on the top of Brownian long-term trends. Statistically, this mixed clock represents an alternative solution to the problem of choosing between autocorrelated and uncorrelated relaxed clocks, by proposing instead to combine their respective merits. Fitting this model on a dataset of 105 placental mammals, using both node-dating and tip-dating approaches, suggests that the two pure clocks, Brownian and white noise, are rejected in favour of a mixed model with approximately equal contributions for its uncorrelated and autocorrelated components. The tip-dating analysis is particularly sensitive to the choice of the relaxed clock model. In this context, the classical pure Brownian relaxed clock appears to be overly rigid, leading to biases in divergence time estimation. By contrast, the use of a mixed clock leads to more recent and more reasonable estimates for the crown ages of placental orders and superorders. Altogether, the mixed clock introduced here represents a first step towards empirically more adequate models of the patterns of rate variation across phylogenetic trees. This article is part of the themed issue ‘Dating species divergences using rocks and clocks’.


Sign in / Sign up

Export Citation Format

Share Document