scholarly journals The Cumulative Indel Model: Fast and Accurate Statistical Evolutionary Alignment

2020 ◽  
Author(s):  
Nicola De Maio

Abstract Sequence alignment is essential for phylogenetic and molecular evolution inference, as well as in many other areas of bioinformatics and evolutionary biology. Inaccurate alignments can lead to severe biases in most downstream statistical analyses. Statistical alignment based on probabilistic models of sequence evolution addresses these issues by replacing heuristic score functions with evolutionary model-based probabilities. However, score-based aligners and fixed-alignment phylogenetic approaches are still more prevalent than methods based on evolutionary indel models, mostly due to computational convenience. Here, I present new techniques for improving the accuracy and speed of statistical evolutionary alignment. The “cumulative indel model” approximates realistic evolutionary indel dynamics using differential equations. “Adaptive banding” reduces the computational demand of most alignment algorithms without requiring prior knowledge of divergence levels or pseudo-optimal alignments. Using simulations, I show that these methods lead to fast and accurate pairwise alignment inference. Also, I show that it is possible, with these methods, to align and infer evolutionary parameters from a single long synteny block ($\approx$530 kbp) between the human and chimp genomes. The cumulative indel model and adaptive banding can therefore improve the performance of alignment and phylogenetic methods. [Evolutionary alignment; pairHMM; sequence evolution; statistical alignment; statistical genetics.]

2016 ◽  
Author(s):  
Kassian Kobert ◽  
Alexandros Stamatakis ◽  
Tomáš Flouri

The phylogenetic likelihood function is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory saving attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 10-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the phylogenetic likelihood function currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.


2018 ◽  
Vol 68 (3) ◽  
pp. 227-246
Author(s):  
Nico M. van Straalen

AbstractEvolution acts through a combination of four different drivers: (1) mutation, (2) selection, (3) genetic drift, and (4) developmental constraints. There is a tendency among some biologists to frame evolution as the sole result of natural selection, and this tendency is reinforced by many popular texts. “The Naked Ape” by Desmond Morris, published 50 years ago, is no exception. In this paper I argue that evolutionary biology is much richer than natural selection alone. I illustrate this by reconstructing the evolutionary history of five different organs of the human body: foot, pelvis, scrotum, hand and brain. Factors like developmental tinkering, by-product evolution, exaptation and heterochrony are powerful forces for body-plan innovations and the appearance of such innovations in human ancestors does not always require an adaptive explanation. While Morris explained the lack of body hair in the human species by sexual selection, I argue that molecular tinkering of regulatory genes expressed in the brain, followed by positive selection for neotenic features, may have been the driving factor, with loss of body hair as a secondary consequence.


Plants ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 358
Author(s):  
Joan Pedrola-Monfort ◽  
David Lázaro-Gimeno ◽  
Carlos G. Boluda ◽  
Laia Pedrola ◽  
Alfonso Garmendia ◽  
...  

Among the most intriguing mysteries in the evolutionary biology of photosynthetic organisms are the genesis and consequences of the dramatic increase in the mitochondrial and nuclear genome sizes, together with the concomitant evolution of the three genetic compartments, particularly during the transition from water to land. To clarify the evolutionary trends in the mitochondrial genome of Archaeplastida, we analyzed the sequences from 37 complete genomes. Therefore, we utilized mitochondrial, plastidial and nuclear ribosomal DNA molecular markers on 100 species of Streptophyta for each subunit. Hierarchical models of sequence evolution were fitted to test the heterogeneity in the base composition. The best resulting phylogenies were used for reconstructing the ancestral Guanine-Cytosine (GC) content and equilibrium GC frequency (GC*) using non-homogeneous and non-stationary models fitted with a maximum likelihood approach. The mitochondrial genome length was strongly related to repetitive sequences across Archaeplastida evolution; however, the length seemed not to be linked to the other studied variables, as different lineages showed diverse evolutionary patterns. In contrast, Streptophyta exhibited a powerful positive relationship between the GC content, non-coding DNA, and repetitive sequences, while the evolution of Chlorophyta reflected a strong positive linear relationship between the genome length and the number of genes.


GigaScience ◽  
2020 ◽  
Vol 9 (6) ◽  
Author(s):  
Ksenia Krasheninnikova ◽  
Mark Diekhans ◽  
Joel Armstrong ◽  
Aleksei Dievskii ◽  
Benedict Paten ◽  
...  

Abstract Background Large-scale sequencing projects provide high-quality full-genome data that can be used for reconstruction of chromosomal exchanges and rearrangements that disrupt conserved syntenic blocks. The highest resolution of cross-species homology can be obtained on the basis of whole-genome, reference-free alignments. Very large multiple alignments of full-genome sequence stored in a binary format demand an accurate and efficient computational approach for synteny block production. Findings halSynteny performs efficient processing of pairwise alignment blocks for any pair of genomes in the alignment. The tool is part of the HAL comparative genomics suite and is targeted to build synteny blocks for multi-hundred–way, reference-free vertebrate alignments built with the Cactus system. Conclusions halSynteny enables an accurate and rapid identification of synteny in multiple full-genome alignments. The method is implemented in C++11 as a component of the halTools software and released under MIT license. The package is available at https://github.com/ComparativeGenomicsToolkit/hal/.


2014 ◽  
Author(s):  
Jesse D Bloom

Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are not knowna priori, making it challenging to create evolutionary models that adequately capture the heterogeneity of selection at different sites. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than commonly used existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al, 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much than most common existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.


2020 ◽  
Author(s):  
Arnaud N’Guessan ◽  
Ilana Lauren Brito ◽  
Adrian W.R. Serohijos ◽  
B. Jesse Shapiro

AbstractPangenomes – the cumulative set of genes encoded by a species – arise from evolutionary forces including horizontal gene transfer (HGT), drift, and selection. The relative importance of drift and selection in shaping pangenome structure has been recently debated, and the role of sequence evolution (point mutations) within mobile genes has been largely ignored, with studies focusing mainly on patterns of gene presence or absence. The effects of drift, selection, and HGT on pangenome evolution likely depends on the time scale being studied, ranging from ancient (e.g., between distantly related species) to recent (e.g., within a single animal host), and the unit of selection being considered (e.g., the gene, whole genome, microbial species, or human host). To shed light on pangenome evolution within microbiomes on relatively recent time scales, we investigate the selective pressures acting on mobile genes using a dataset that previously identified such genes in the gut metagenomes of 176 Fiji islanders. We mapped the metagenomic reads to mobile genes to call single nucleotide variants (SNVs) and calculate population genetic metrics that allowed us to infer deviations from a neutral evolutionary model. We found that mobile gene sequence evolution varied more by gene family than by human social attributes, such as household or village membership, suggesting that selection at the level of gene function is most relevant on these short time scales. Patterns of mobile gene sequence evolution could be qualitatively recapitulated with a simple evolutionary simulation, without the need to invoke an adaptive advantage of mobile genes to their bacterial host genome. This suggests that, at least on short time scales, a majority of the pangenome need not be adaptive. On the other hand, a subset of gene functions including defense mechanisms and secondary metabolism showed an aberrant pattern of molecular evolution, consistent with species-specific selective pressures or negative frequency-dependent selection not seen in prophages, transposons, or other gene categories. That mobile genes of different functions behave so differently suggests stronger selection at the gene level, rather than at the genome level. While pangenomes may be largely adaptive to their bacterial hosts on longer evolution time scales, here we show that, on shorter “human” time scales, drift and gene-specific selection predominate.


2018 ◽  
Author(s):  
Sayyed Auwn Muhammad ◽  
Bengt Sennblad ◽  
Jens Lagergren

AbstractMost genes are composed of multiple domains, with a common evolutionary history, that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Analogously to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.We introduce the DomainDLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.For this model, we present a MCMC-based inference framework called DomainDLRS that takes a dated species tree together with a multiple sequence alignment for each domain family as input and outputs an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning full-length genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that DomainDLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zincfinger genes and show that most domain duplications have been tandem duplications, some involving two or more domains, but non-tandem duplications have also been common.


2000 ◽  
Vol 32 (2) ◽  
pp. 499-517 ◽  
Author(s):  
Jens Ledet Jensen ◽  
Anne-Mette Krabbe Pedersen

We consider Markov processes of DNA sequence evolution in which the instantaneous rates of substitution at a site are allowed to depend upon the states at the sites in a neighbourhood of the site at the instant of the substitution. We characterize the class of Markov process models of DNA sequence evolution for which the stationary distribution is a Gibbs measure, and give a procedure for calculating the normalizing constant of the measure. We develop an MCMC method for estimating the transition probability between sequences under models of this type. Finally, we analyse an alignment of two HIV-1 gene sequences using the developed theory and methodology.


2000 ◽  
Vol 32 (02) ◽  
pp. 499-517 ◽  
Author(s):  
Jens Ledet Jensen ◽  
Anne-Mette Krabbe Pedersen

We consider Markov processes of DNA sequence evolution in which the instantaneous rates of substitution at a site are allowed to depend upon the states at the sites in a neighbourhood of the site at the instant of the substitution. We characterize the class of Markov process models of DNA sequence evolution for which the stationary distribution is a Gibbs measure, and give a procedure for calculating the normalizing constant of the measure. We develop an MCMC method for estimating the transition probability between sequences under models of this type. Finally, we analyse an alignment of two HIV-1 gene sequences using the developed theory and methodology.


2017 ◽  
Vol 114 (23) ◽  
pp. 5784-5791 ◽  
Author(s):  
Carrie A. Whittle ◽  
Cassandra G. Extavour

In animals, primordial germ cells (PGCs) give rise to the germ lines, the cell lineages that produce sperm and eggs. PGCs form in embryogenesis, typically by one of two modes: a likely ancestral mode wherein germ cells are induced during embryogenesis by cell–cell signaling (induction) or a derived mechanism whereby germ cells are specified by using germ plasm—that is, maternally specified germ-line determinants (inheritance). The causes of the shift to germ plasm for PGC specification in some animal clades remain largely unknown, but its repeated convergent evolution raises the question of whether it may result from or confer an innate selective advantage. It has been hypothesized that the acquisition of germ plasm confers enhanced evolvability, resulting from the release of selective constraint on somatic gene networks in embryogenesis, thus leading to acceleration of an organism’s protein-sequence evolution, particularly for genes expressed at early developmental stages, and resulting in high speciation rates in germ plasm-containing lineages (denoted herein as the “PGC-specification hypothesis”). Although that hypothesis, if supported, could have major implications for animal evolution, our recent large-scale coding-sequence analyses from vertebrates and invertebrates provided important examples of genera that do not support the hypothesis of liberated constraint under germ plasm. Here, we consider reasons why germ plasm might be neither a direct target of selection nor causally linked to accelerated animal evolution. We explore alternate scenarios that could explain the repeated evolution of germ plasm and propose potential consequences of the inheritance and induction modes to animal evolutionary biology.


Sign in / Sign up

Export Citation Format

Share Document