scholarly journals Efficient inference, potential, and limitations of site-specific substitution models

2020 ◽  
Vol 6 (2) ◽  
Author(s):  
Vadim Puller ◽  
Pavel Sagulenko ◽  
Richard A Neher

Abstract Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites. On one hand, such constraints on sequence evolution can be to infer biological function, one the other hand they need to be accounted for in phylogenetic reconstruction. Phylogenetic models often account for this complexity by partitioning sites into a small number of discrete classes with different rates and/or state preferences. Appropriate model complexity is typically determined by model selection procedures. Here, we present an efficient algorithm to estimate more complex models that allow for different preferences at every site and explore the accuracy at which such models can be estimated from simulated data. Our iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences and known topology. However, the joint estimation of site-specific rates, and site-specific preferences, and phylogenetic branch length can suffer from identifiability problems, while ignoring variation in preferences across sites results in branch length underestimates. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of these substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.

Author(s):  
Vadim Puller ◽  
Pavel Sagulenko ◽  
Richard A. Neher

AbstractNatural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states or only change in concert with other sites. Most commonly used evolutionary models, however, ignore much of this complexity and at best account for variation in the rate at which different sites change. Here, we present an efficient algorithm to estimate more complex models that allow for site-specific preferences and explore the accuracy at which such models can be estimated from simulated data. We find that an iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences. Ignoring site-specific preferences during estimation of branch length of phylogenetic trees – an assumption of most phylogeny software – results in substantial underestimation comparable to the error incurred when ignoring rate variation. However, the joint estimation of branch lengths, site-specific rates, and site-specific preferences can suffer from identifiability problems and is typically unable to recover the correct branch lengths. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of site-specific HIV substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.


Author(s):  
Giovanni Piccinini ◽  
Mariangela Iannello ◽  
Guglielmo Puccio ◽  
Federico Plazzi ◽  
Justin C Havird ◽  
...  

Abstract In Metazoa, 4 out of 5 complexes involved in oxidative phosphorylation (OXPHOS) are formed by subunits encoded by both the mitochondrial (mtDNA) and nuclear (nuDNA) genomes, leading to the expectation of mito-nuclear coevolution. Previous studies have supported co-adaptation of mitochondria-encoded (mtOXPHOS) and nuclear-encoded OXPHOS (nuOXPHOS) subunits, often specifically interpreted with regard to the “nuclear compensation hypothesis”, a specific form of mitonuclear coevolution where nuclear genes compensate for deleterious mitochondrial mutations owing to less efficient mitochondrial selection. In this study we analysed patterns of sequence evolution of 79 OXPHOS subunits in 31 bivalve species, a taxon showing extraordinary mtDNA variability and including species with “doubly uniparental” mtDNA inheritance. Our data showed strong and clear signals of mitonuclear coevolution. NuOXPHOS subunits had concordant topologies with mtOXPHOS subunits, contrary to previous phylogenies based on nuclear genes lacking mt interactions. Evolutionary rates between mt and nuOXPHOS subunits were also highly correlated compared to non-OXPHOS-interacting nuclear genes. Nuclear subunits of chimeric OXPHOS complexes (I, III, IV, and V) also had higher dN/dS ratios than Complex II, which is formed exclusively by nuDNA-encoded subunits. However, we did not find evidence of nuclear compensation: mitochondria-encoded subunits showed similar dN/dS ratios compared to nuclear-encoded subunits, contrary to most previously studied bilaterian animals. Moreover, no site-specific signals of compensatory positive selection were detected in nuOXPHOS genes. Our analyses extend the evidence for mitonuclear coevolution to a new taxonomic group, but we propose a reconsideration of the nuclear compensation hypothesis.


2018 ◽  
Vol 19 (12) ◽  
pp. 4039 ◽  
Author(s):  
Mi-Li Liu ◽  
Wei-Bing Fan ◽  
Ning Wang ◽  
Peng-Bin Dong ◽  
Ting-Ting Zhang ◽  
...  

Plant plastomes play crucial roles in species evolution and phylogenetic reconstruction studies due to being maternally inherited and due to the moderate evolutionary rate of genomes. However, patterns of sequence divergence and molecular evolution of the plastid genomes in the horticulturally- and economically-important Lonicera L. species are poorly understood. In this study, we collected the complete plastomes of seven Lonicera species and determined the various repeat sequence variations and protein sequence evolution by comparative genomic analysis. A total of 498 repeats were identified in plastid genomes, which included tandem (130), dispersed (277), and palindromic (91) types of repeat variations. Simple sequence repeat (SSR) elements analysis indicated the enriched SSRs in seven genomes to be mononucleotides, followed by tetra-nucleotides, dinucleotides, tri-nucleotides, hex-nucleotides, and penta-nucleotides. We identified 18 divergence hotspot regions (rps15, rps16, rps18, rpl23, psaJ, infA, ycf1, trnN-GUU-ndhF, rpoC2-rpoC1, rbcL-psaI, trnI-CAU-ycf2, psbZ-trnG-UCC, trnK-UUU-rps16, infA-rps8, rpl14-rpl16, trnV-GAC-rrn16, trnL-UAA intron, and rps12-clpP) that could be used as the potential molecular genetic markers for the further study of population genetics and phylogenetic evolution of Lonicera species. We found that a large number of repeat sequences were distributed in the divergence hotspots of plastid genomes. Interestingly, 16 genes were determined under positive selection, which included four genes for the subunits of ribosome proteins (rps7, rpl2, rpl16, and rpl22), three genes for the subunits of photosystem proteins (psaJ, psbC, and ycf4), three NADH oxidoreductase genes (ndhB, ndhH, and ndhK), two subunits of ATP genes (atpA and atpB), and four other genes (infA, rbcL, ycf1, and ycf2). Phylogenetic analysis based on the whole plastome demonstrated that the seven Lonicera species form a highly-supported monophyletic clade. The availability of these plastid genomes provides important genetic information for further species identification and biological research on Lonicera.


2019 ◽  
Vol 37 (5) ◽  
pp. 1495-1507 ◽  
Author(s):  
Zhengting Zou ◽  
Hongjiu Zhang ◽  
Yuanfang Guan ◽  
Jianzhi Zhang

Abstract Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).


Plants ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 358
Author(s):  
Joan Pedrola-Monfort ◽  
David Lázaro-Gimeno ◽  
Carlos G. Boluda ◽  
Laia Pedrola ◽  
Alfonso Garmendia ◽  
...  

Among the most intriguing mysteries in the evolutionary biology of photosynthetic organisms are the genesis and consequences of the dramatic increase in the mitochondrial and nuclear genome sizes, together with the concomitant evolution of the three genetic compartments, particularly during the transition from water to land. To clarify the evolutionary trends in the mitochondrial genome of Archaeplastida, we analyzed the sequences from 37 complete genomes. Therefore, we utilized mitochondrial, plastidial and nuclear ribosomal DNA molecular markers on 100 species of Streptophyta for each subunit. Hierarchical models of sequence evolution were fitted to test the heterogeneity in the base composition. The best resulting phylogenies were used for reconstructing the ancestral Guanine-Cytosine (GC) content and equilibrium GC frequency (GC*) using non-homogeneous and non-stationary models fitted with a maximum likelihood approach. The mitochondrial genome length was strongly related to repetitive sequences across Archaeplastida evolution; however, the length seemed not to be linked to the other studied variables, as different lineages showed diverse evolutionary patterns. In contrast, Streptophyta exhibited a powerful positive relationship between the GC content, non-coding DNA, and repetitive sequences, while the evolution of Chlorophyta reflected a strong positive linear relationship between the genome length and the number of genes.


2020 ◽  
Vol 37 (9) ◽  
pp. 2747-2762 ◽  
Author(s):  
Guénola Drillon ◽  
Raphaël Champeimont ◽  
Francesco Oteri ◽  
Gilles Fischer ◽  
Alessandra Carbone

Abstract Gene order can be used as an informative character to reconstruct phylogenetic relationships between species independently from the local information present in gene/protein sequences. PhyChro is a reconstruction method based on chromosomal rearrangements, applicable to a wide range of eukaryotic genomes with different gene contents and levels of synteny conservation. For each synteny breakpoint issued from pairwise genome comparisons, the algorithm defines two disjoint sets of genomes, named partial splits, respectively, supporting the two block adjacencies defining the breakpoint. Considering all partial splits issued from all pairwise comparisons, a distance between two genomes is computed from the number of partial splits separating them. Tree reconstruction is achieved through a bottom-up approach by iteratively grouping sister genomes minimizing genome distances. PhyChro estimates branch lengths based on the number of synteny breakpoints and provides confidence scores for the branches. PhyChro performance is evaluated on two data sets of 13 vertebrates and 21 yeast genomes by using up to 130,000 and 179,000 breakpoints, respectively, a scale of genomic markers that has been out of reach until now. PhyChro reconstructs very accurate tree topologies even at known problematic branching positions. Its robustness has been benchmarked for different synteny block reconstruction methods. On simulated data PhyChro reconstructs phylogenies perfectly in almost all cases, and shows the highest accuracy compared with other existing tools. PhyChro is very fast, reconstructing the vertebrate and yeast phylogenies in <15 min.


2017 ◽  
Vol 66 (6) ◽  
pp. 917-933 ◽  
Author(s):  
Eli Levy Karin ◽  
Susann Wicke ◽  
Tal Pupko ◽  
Itay Mayrose

2019 ◽  
Vol 76 (6) ◽  
pp. 856-870 ◽  
Author(s):  
Skip McKinnell

Pulses of abundance in salmon migrations can arise from single populations arriving at different times, from multiple populations with different timing characteristics, or as a combination of these. Daily observations typically record an aggregate measure of abundance passing some location rather than the abundances of the individual components. An objective method is described that partitions a compound migration into its component parts by exploiting differences in the characteristics of each pulse. Simulated data were used to demonstrate when greater model complexity may be desirable. Three case studies of increasing complexity (Chilko Lake sockeye salmon smolts (Oncorhynchus nerka), large adult Columbia River Chinook salmon (Oncorhynchus tshawytscha), Fraser River salmon test fishery) demonstrate how the model can be applied in practice. Results indicated that Chilko Lake smolts rarely emigrate to sea as a single pulse, that the dates used to distinguish the spring run of Chinook salmon in the Columbia River may be overestimating its abundance, and that pulses of sockeye salmon abundance in a Fraser River ocean test fishery in 2014 may have arisen from some factor other than population composition.


2021 ◽  
Author(s):  
◽  
Mei Lin Tay

<p>Phylogenetic analyses using molecular data were used to investigate biogeographic and evolutionary patterns of Australasian Plantago. The Internal Transcribed Spacers (ITS) from nuclear DNA, ndhF-rpl32 from chloroplast DNA and cox1 from mitochondrial DNA were selected from a primer assay of 24 primer pairs for further phylogenetic analyses. Phylogenetic reconstruction and molecular dating of a dataset concatenated from these regions comprising 20 Australasian Plantago species rejected a hypothesis of Gondwanan vicariance for the Australasian group. The phylogeny revealed three independent dispersal events from Australia to New Zealand that match expected direction because of West Wind Drift and ocean currents. Following this study, a dataset with 150 new ITS sequences from Australasian Plantago, combined with 89 Plantago sequences from previous studies, revealed that the New Zealand species appear to have a recent origin from Australia, not long after the formation of suitable habitats formed by the uplift of the Southern Alps (about 5 mya), followed by radiation. The ITS phylogeny also suggests that a single migration event of alpine species to lowland habitats has occurred and that recurrent polyploidy appears to be an important speciation mechanism in the genus. Species boundaries between New Zealand Plantago were unclear using both morphological and molecular data, which was a result of low genetic divergences and plastic morphology. The taxonomy of several New Zealand Plantago species need revision based on the ITS phylogeny.</p>


Sign in / Sign up

Export Citation Format

Share Document