scholarly journals Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation

2015 ◽  
Author(s):  
Kiyoshi Ezawa ◽  
Dan Graur ◽  
Giddy Landan

AbstractBackgroundInsertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. In a separate paper (Ezawa, Graur and Landan 2015a), we established an ab initio perturbative formulation of a continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. And we showed that, under a certain set of conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) separated by gapless columns. Moreover, in another separate paper (Ezawa, Graur and Landan 2015b), we performed concrete perturbation analyses on all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). The analyses indicated that even the fewest-indel terms alone can quite accurately approximate the probabilities of local alignments, as long as the segments and the branches in the tree are of modest lengths.ResultsTo examine whether or not the fewest-indel terms alone can well approximate the alignment probabilities of more general types of local MSAs as well, and as a first step toward the automatic application of our ab initio perturbative formulation, we developed an algorithm that calculates the first approximation of the probability of a given MSA under a given parameter setting including a phylogenetic tree. The algorithm first chops the MSA into gapped and gapless segments, second enumerates all parsimonious indel histories potentially responsible for each gapped segment, and finally calculates their contributions to the MSA probability. We performed validation analyses using more than ten million local MSAs. The results indicated that even the first approximation can quite accurately estimate the probability of each local MSA, as long as the gaps and tree branches are at most moderately long.ConclusionsThe newly developed algorithm, called LOLIPOG, brought our ab initio perturbation formulation at least one step closer to a practically useful method to quite accurately calculate the probability of a MSA under a given biologically realistic parameter setting.[This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the ab initio perturbative formulation of a general continuous-time Markov model of indels.]List of abbreviationsHMMhidden Markov modelindelinsertion/deletionLHSlocal history setMSAmultiple sequence alignmentPASpreserved ancestral sitePWApairwise alignment


2015 ◽  
Author(s):  
Kiyoshi Ezawa ◽  
Dan Graur ◽  
Giddy Landan

AbstractBackgroundInsertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. In a separate paper (Ezawa, Graur and Landan 2015a), we established a theoretical basis of our ab initio perturbative formulation of a genuine evolutionary model, more specifically, a continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. And we showed that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) separated by gapless columns.ResultsThis paper describes how our ab initio perturbative formulation can be concretely used to approximately calculate the probabilities of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). For each local alignment type, we calculated the fewest-indel contribution and the next-fewest-indel contribution to its probability, and we compared them under various conditions. We also derived a system of integral equations that can be numerically solved to give “exact solutions” for some common types of local PWAs. And we compared the obtained “exact solutions” with the fewest-indel contributions. The results indicated that even the fewest-indel terms alone can quite accurately approximate the probabilities of local alignments, as long as the segments and the branches in the tree are of modest lengths. Moreover, in the light of our formulation, we examined parameter regions where other indel models can safely approximate the correct evolutionary probabilities. The analyses also suggested some modifications necessary for these models to improve the accuracy of their probability estimations.ConclusionsAt least under modest conditions, our ab initio perturbative formulation can quite accurately calculate alignment probabilities under biologically realistic indel models. It also provides a sound reference point that other indel models can be compared to. [This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the ab initio perturbative formulation of a general continuous-time Markov model of indels.]



2015 ◽  
Author(s):  
Kiyoshi Ezawa ◽  
Dan Graur ◽  
Giddy Landan

AbstractBackgroundInsertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, such probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions.ResultsHere, we theoretically tackle the ab initio calculation of the probability of a given sequence alignment under a genuine evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model allows general indel rate parameters including length distributions but does not impose any unrealistic restrictions on indels. Using techniques of the perturbation theory in physics, we expand the probability into a series over different numbers of indels. Our derivation of this perturbation expansion elegantly bridges the gap between Gillespie’s (1977) intuitive derivation of his own stochastic simulation method, which is now widely used in evolutionary simulators, and Feller’s (1940) mathematically rigorous theorems that underpin Gillespie′s method. We find a sufficient and nearly necessary set of conditions under which the probability can be expressed as the product of an overall factor and the contributions from regions separated by gapless columns of the alignment. The indel models satisfying these conditions include those with some kind of rate variation across regions, as well as space-homogeneous models. We also prove that, though with a caveat, pairwise probabilities calculated by the method of Miklós et al. (2004) are equivalent to those calculated by our ab initio formulation, at least under a space-homogenous model.ConclusionsOur ab initio perturbative formulation provides a firm theoretical ground that other indel models can rest on.[This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the ab initio perturbative formulation of a general continuous-time Markov model of indels.]



2015 ◽  
Author(s):  
Kiyoshi Ezawa ◽  
Dan Graur ◽  
Giddy Landan

BackgroundInsertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. In a separate paper (Ezawa, Graur and Landan 2015a), we established the theoretical basis of ourab initioperturbative formulation of a continuous-time Markov model of the evolution of anentiresequence via insertions and deletions along time axis. In other separate papers (Ezawa, Graur and Landan 2015b,c), we also developed various analytical and computational methods to concretely calculate alignment probabilities via our formulation. In terms of frequencies, however, substitutions are usually more common than indels. Moreover, many experiments suggest that other mutations, such as genomic rearrangements and recombination, also play some important roles in sequence evolution.ResultsHere, we extend ourab initioperturbative formulation of agenuineevolutionary model so that it can incorporate other mutations. We give a sufficient set of conditions that the probability of evolution via both indels and substitutions is factorable into the product of an overall factor and local contributions. We also show that, under a set of conditions, the probability can be factorized into two sub-probabilities, one via indels alone and the other via substitutions alone. Moreover, we show that our formulation can be extended so that it can also incorporate genomic rearrangements, such as inversions and duplications. We also discuss how to accommodate some other types of mutations within our formulation.ConclusionsOurab initioperturbative formulation thus extended could in principle describe the stochastic evolution of anentiresequence along time axis via major types of mutations. [This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend theab initioperturbative formulation of a general continuous-time Markov model of indels.]



2020 ◽  
Vol 117 (11) ◽  
pp. 5873-5882 ◽  
Author(s):  
Jose Alberto de la Paz ◽  
Charisse M. Nartey ◽  
Monisha Yuvaraj ◽  
Faruck Morcos

We introduce a model of amino acid sequence evolution that accounts for the statistical behavior of real sequences induced by epistatic interactions. We base the model dynamics on parameters derived from multiple sequence alignments analyzed by using direct coupling analysis methodology. Known statistical properties such as overdispersion, heterotachy, and gamma-distributed rate-across-sites are shown to be emergent properties of this model while being consistent with neutral evolution theory, thereby unifying observations from previously disjointed evolutionary models of sequences. The relationship between site restriction and heterotachy is characterized by tracking the effective alphabet dynamics of sites. We also observe an evolutionary Stokes shift in the fitness of sequences that have undergone evolution under our simulation. By analyzing the structural information of some proteins, we corroborate that the strongest Stokes shifts derive from sites that physically interact in networks near biochemically important regions. Perspectives on the implementation of our model in the context of the molecular clock are discussed.



2008 ◽  
Vol 363 (1512) ◽  
pp. 4041-4047 ◽  
Author(s):  
Steffen Klaere ◽  
Tanja Gesell ◽  
Arndt von Haeseler

We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column.



2021 ◽  
Author(s):  
Robert M. Hubley ◽  
Travis J. Wheeler ◽  
Arian F.A. Smit

The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Such alignments play an important role in understanding and representing TE family history. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family. As a result, consensus sequences derived from Refiner-based MSAs are more similar to the true consensus.



Sign in / Sign up

Export Citation Format

Share Document