scholarly journals Using a GTR+Γ substitution model for dating sequence divergence when stationarity and time-reversibility assumptions are violated

2020 ◽  
Author(s):  
Jose Barba-Montoya ◽  
Qiqing Tao ◽  
Sudhir Kumar

AbstractMotivationAs the number and diversity of species and genes grow in contemporary datasets, two common assumptions made in all molecular dating methods, namely the time-reversibility and stationarity of the substitution process, become untenable. No software tools for molecular dating allow researchers to relax these two assumptions in their data analyses. Frequently the same General Time Reversible (GTR) model across lineages along with a gamma (+Γ) distributed rates across sites is used in relaxed clock analyses, which assumes time-reversibility and stationarity of the substitution process. Many reports have quantified the impact of violations of these underlying assumptions on molecular phylogeny, but none have systematically analyzed their impact on divergence time estimates.ResultsWe quantified the bias on time estimates that resulted from using the GTR+Γ model for the analysis of computer-simulated nucleotide sequence alignments that were evolved with non-stationary (NS) and non-reversible (NR) substitution models. We tested Bayesian and RelTime approaches that do not require a molecular clock for estimating divergence times. Divergence times obtained using a GTR+Γ model differed only slightly (∼3% on average) from the expected times for NR datasets, but the difference was larger for NS datasets (∼10% on average). The use of only a few calibrations reduced these biases considerably (∼5%). Confidence and credibility intervals from GTR+Γ analysis usually contained correct times. Therefore, the bias introduced by the use of the GTR+Γ model to analyze datasets, in which the time-reversibility and stationarity assumptions are violated, is likely not large and can be reduced by applying multiple calibrations.AvailabilityAll datasets are deposited in Figshare: https://doi.org/10.6084/[email protected]

2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i884-i894
Author(s):  
Jose Barba-Montoya ◽  
Qiqing Tao ◽  
Sudhir Kumar

Abstract Motivation As the number and diversity of species and genes grow in contemporary datasets, two common assumptions made in all molecular dating methods, namely the time-reversibility and stationarity of the substitution process, become untenable. No software tools for molecular dating allow researchers to relax these two assumptions in their data analyses. Frequently the same General Time Reversible (GTR) model across lineages along with a gamma (+Γ) distributed rates across sites is used in relaxed clock analyses, which assumes time-reversibility and stationarity of the substitution process. Many reports have quantified the impact of violations of these underlying assumptions on molecular phylogeny, but none have systematically analyzed their impact on divergence time estimates. Results We quantified the bias on time estimates that resulted from using the GTR + Γ model for the analysis of computer-simulated nucleotide sequence alignments that were evolved with non-stationary (NS) and non-reversible (NR) substitution models. We tested Bayesian and RelTime approaches that do not require a molecular clock for estimating divergence times. Divergence times obtained using a GTR + Γ model differed only slightly (∼3% on average) from the expected times for NR datasets, but the difference was larger for NS datasets (∼10% on average). The use of only a few calibrations reduced these biases considerably (∼5%). Confidence and credibility intervals from GTR + Γ analysis usually contained correct times. Therefore, the bias introduced by the use of the GTR + Γ model to analyze datasets, in which the time-reversibility and stationarity assumptions are violated, is likely not large and can be reduced by applying multiple calibrations. Availability and implementation All datasets are deposited in Figshare: https://doi.org/10.6084/m9.figshare.12594638.


2020 ◽  
Author(s):  
Qiqing Tao ◽  
Jose Barba-Montoya ◽  
Louise A. Huuki ◽  
Mary Kathleen Durnan ◽  
Sudhir Kumar

AbstractThe conventional wisdom in molecular evolution is to apply parameter-rich models of nucleotide and amino acid substitutions for estimating divergence times. However, the actual extent of the difference between time estimates produced by highly complex models compared to those from simple models is yet to be quantified for contemporary datasets that frequently contain sequences from many species and genes. In a reanalysis of many large multispecies alignments from diverse groups of taxa using the same tree topologies and calibrations, we found that the use of the simplest models can produce divergence time estimates and credibility intervals similar to those obtained from the complex models applied in the original studies. This result is surprising because the use of simple models underestimates sequence divergence for all the datasets analyzed. We find three fundamental reasons for the observed robustness of time estimates to model complexity in many practical datasets. First, the estimates of branch lengths and node-to-tip distances under the simplest model show an approximately linear relationship with those produced by using the most complex models applied, especially for datasets with many sequences. Second, relaxed clock methods automatically adjust rates on branches that experience considerable underestimation of sequence divergences, resulting in time estimates that are similar to those from complex models. And, third, the inclusion of even a few good calibrations in an analysis can reduce the difference in time estimates from simple and complex models. The robustness of time estimates to models complexity in these empirical data analyses is encouraging, because all phylogenomics studies use statistical models that are oversimplified descriptions of actual evolutionary substitution processes.


2020 ◽  
Vol 37 (6) ◽  
pp. 1819-1831
Author(s):  
Qiqing Tao ◽  
Jose Barba-Montoya ◽  
Louise A Huuki ◽  
Mary Kathleen Durnan ◽  
Sudhir Kumar

Abstract The conventional wisdom in molecular evolution is to apply parameter-rich models of nucleotide and amino acid substitutions for estimating divergence times. However, the actual extent of the difference between time estimates produced by highly complex models compared with those from simple models is yet to be quantified for contemporary data sets that frequently contain sequences from many species and genes. In a reanalysis of many large multispecies alignments from diverse groups of taxa, we found that the use of the simplest models can produce divergence time estimates and credibility intervals similar to those obtained from the complex models applied in the original studies. This result is surprising because the use of simple models underestimates sequence divergence for all the data sets analyzed. We found three fundamental reasons for the observed robustness of time estimates to model complexity in many practical data sets. First, the estimates of branch lengths and node-to-tip distances under the simplest model show an approximately linear relationship with those produced by using the most complex models applied on data sets with many sequences. Second, relaxed clock methods automatically adjust rates on branches that experience considerable underestimation of sequence divergences, resulting in time estimates that are similar to those from complex models. And, third, the inclusion of even a few good calibrations in an analysis can reduce the difference in time estimates from simple and complex models. The robustness of time estimates to model complexity in these empirical data analyses is encouraging, because all phylogenomics studies use statistical models that are oversimplified descriptions of actual evolutionary substitution processes.


2019 ◽  
Author(s):  
Qiqing Tao ◽  
Koichiro Tamura ◽  
Beatriz Mello ◽  
Sudhir Kumar

AbstractConfidence intervals (CIs) depict the statistical uncertainty surrounding evolutionary divergence time estimates. They capture variance contributed by the finite number of sequences and sites used in the alignment, deviations of evolutionary rates from a strict molecular clock in a phylogeny, and uncertainty associated with clock calibrations. Reliable tests of biological hypotheses demand reliable CIs. However, current non-Bayesian methods may produce unreliable CIs because they do not incorporate rate variation among lineages and interactions among clock calibrations properly. Here, we present a new analytical method to calculate CIs of divergence times estimated using the RelTime method, along with an approach to utilize multiple calibration uncertainty densities in these analyses. Empirical data analyses showed that the new methods produce CIs that overlap with Bayesian highest posterior density (HPD) intervals. In the analysis of computer-simulated data, we found that RelTime CIs show excellent average coverage probabilities, i.e., the true time is contained within the CIs with a 95% probability. These developments will encourage broader use of computationally-efficient RelTime approach in molecular dating analyses and biological hypothesis testing.


2013 ◽  
Vol 299 (3) ◽  
pp. 585-601 ◽  
Author(s):  
Kathrin Feldberg ◽  
Jochen Heinrichs ◽  
Alexander R. Schmidt ◽  
Jiří Váňa ◽  
Harald Schneider

2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
James A. Schulte

Methods for estimating divergence times from molecular data have improved dramatically over the past decade, yet there are few studies examining alternative taxon sampling effects on node age estimates. Here, I investigate the effect of undersampling species diversity on node ages of the South American lizard clade Liolaemini using several alternative subsampling strategies for both time calibrations and taxa numbers. Penalized likelihood (PL) and Bayesian molecular dating analyses were conducted on a densely sampled (202 taxa) mtDNA-based phylogenetic hypothesis of Iguanidae, including 92 Liolaemini species. Using all calibrations and penalized likelihood, clades with very low taxon sampling had node age estimates younger than clades with more complete taxon sampling. The effect of Bayesian and PL methods differed when either one or two calibrations only were used with dense taxon sampling. Bayesian node ages were always older when fewer calibrations were used, whereas PL node ages were always younger. This work reinforces two important points: (1) whenever possible, authors should strongly consider adding as many taxa as possible, including numerous outgroups, prior to node age estimation to avoid considerable node age underestimation and (2) using more, critically assessed, and accurate fossil calibrations should yield improved divergence time estimates.


2019 ◽  
Vol 69 (4) ◽  
pp. 660-670 ◽  
Author(s):  
Tom Carruthers ◽  
Michael J Sanderson ◽  
Robert W Scotland

Abstract Rate variation adds considerable complexity to divergence time estimation in molecular phylogenies. Here, we evaluate the impact of lineage-specific rates—which we define as among-branch-rate-variation that acts consistently across the entire genome. We compare its impact to residual rates—defined as among-branch-rate-variation that shows a different pattern of rate variation at each sampled locus, and gene-specific rates—defined as variation in the average rate across all branches at each sampled locus. We show that lineage-specific rates lead to erroneous divergence time estimates, regardless of how many loci are sampled. Further, we show that stronger lineage-specific rates lead to increasing error. This contrasts to residual rates and gene-specific rates, where sampling more loci significantly reduces error. If divergence times are inferred in a Bayesian framework, we highlight that error caused by lineage-specific rates significantly reduces the probability that the 95% highest posterior density includes the correct value, and leads to sensitivity to the prior. Use of a more complex rate prior—which has recently been proposed to model rate variation more accurately—does not affect these conclusions. Finally, we show that the scale of lineage-specific rates used in our simulation experiments is comparable to that of an empirical data set for the angiosperm genus Ipomoea. Taken together, our findings demonstrate that lineage-specific rates cause error in divergence time estimates, and that this error is not overcome by analyzing genomic scale multilocus data sets. [Divergence time estimation; error; rate variation.]


2019 ◽  
Vol 99 (1) ◽  
pp. 105-367 ◽  
Author(s):  
Mao-Qiang He ◽  
Rui-Lin Zhao ◽  
Kevin D. Hyde ◽  
Dominik Begerow ◽  
Martin Kemler ◽  
...  

AbstractThe Basidiomycota constitutes a major phylum of the kingdom Fungi and is second in species numbers to the Ascomycota. The present work provides an overview of all validly published, currently used basidiomycete genera to date in a single document. An outline of all genera of Basidiomycota is provided, which includes 1928 currently used genera names, with 1263 synonyms, which are distributed in 241 families, 68 orders, 18 classes and four subphyla. We provide brief notes for each accepted genus including information on classification, number of accepted species, type species, life mode, habitat, distribution, and sequence information. Furthermore, three phylogenetic analyses with combined LSU, SSU, 5.8s, rpb1, rpb2, and ef1 datasets for the subphyla Agaricomycotina, Pucciniomycotina and Ustilaginomycotina are conducted, respectively. Divergence time estimates are provided to the family level with 632 species from 62 orders, 168 families and 605 genera. Our study indicates that the divergence times of the subphyla in Basidiomycota are 406–430 Mya, classes are 211–383 Mya, and orders are 99–323 Mya, which are largely consistent with previous studies. In this study, all phylogenetically supported families were dated, with the families of Agaricomycotina diverging from 27–178 Mya, Pucciniomycotina from 85–222 Mya, and Ustilaginomycotina from 79–177 Mya. Divergence times as additional criterion in ranking provide additional evidence to resolve taxonomic problems in the Basidiomycota taxonomic system, and also provide a better understanding of their phylogeny and evolution.


2019 ◽  
Vol 69 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Yuan Nie ◽  
Charles S P Foster ◽  
Tianqi Zhu ◽  
Ru Yao ◽  
David A Duchêne ◽  
...  

Abstract Establishing an accurate evolutionary timescale for green plants (Viridiplantae) is essential to understanding their interaction and coevolution with the Earth’s climate and the many organisms that rely on green plants. Despite being the focus of numerous studies, the timing of the origin of green plants and the divergence of major clades within this group remain highly controversial. Here, we infer the evolutionary timescale of green plants by analyzing 81 protein-coding genes from 99 chloroplast genomes, using a core set of 21 fossil calibrations. We test the sensitivity of our divergence-time estimates to various components of Bayesian molecular dating, including the tree topology, clock models, clock-partitioning schemes, rate priors, and fossil calibrations. We find that the choice of clock model affects date estimation and that the independent-rates model provides a better fit to the data than the autocorrelated-rates model. Varying the rate prior and tree topology had little impact on age estimates, with far greater differences observed among calibration choices and clock-partitioning schemes. Our analyses yield date estimates ranging from the Paleoproterozoic to Mesoproterozoic for crown-group green plants, and from the Ediacaran to Middle Ordovician for crown-group land plants. We present divergence-time estimates of the major groups of green plants that take into account various sources of uncertainty. Our proposed timeline lays the foundation for further investigations into how green plants shaped the global climate and ecosystems, and how embryophytes became dominant in terrestrial environments.


2019 ◽  
Vol 37 (1) ◽  
pp. 280-290 ◽  
Author(s):  
Qiqing Tao ◽  
Koichiro Tamura ◽  
Beatriz Mello ◽  
Sudhir Kumar

Abstract Confidence intervals (CIs) depict the statistical uncertainty surrounding evolutionary divergence time estimates. They capture variance contributed by the finite number of sequences and sites used in the alignment, deviations of evolutionary rates from a strict molecular clock in a phylogeny, and uncertainty associated with clock calibrations. Reliable tests of biological hypotheses demand reliable CIs. However, current non-Bayesian methods may produce unreliable CIs because they do not incorporate rate variation among lineages and interactions among clock calibrations properly. Here, we present a new analytical method to calculate CIs of divergence times estimated using the RelTime method, along with an approach to utilize multiple calibration uncertainty densities in dating analyses. Empirical data analyses showed that the new methods produce CIs that overlap with Bayesian highest posterior density intervals. In the analysis of computer-simulated data, we found that RelTime CIs show excellent average coverage probabilities, that is, the actual time is contained within the CIs with a 94% probability. These developments will encourage broader use of computationally efficient RelTime approaches in molecular dating analyses and biological hypothesis testing.


Sign in / Sign up

Export Citation Format

Share Document