Consistency and identifiability of the polymorphism-aware phylogenetic models

Mapping Intimacies ◽

10.1101/718320 ◽

2019 ◽

Author(s):

Rui Borges ◽

Carolin Kosiol

Keyword(s):

Incomplete Lineage Sorting ◽

Simulated Data ◽

Population Level ◽

Species Tree ◽

Population Variation ◽

Necessary Condition ◽

Lineage Sorting ◽

Data Set ◽

Genome Wide ◽

Phylogenetic Models

AbstractPolymorphism-aware phylogenetic models (PoMo) constitute an alternative approach for species tree estimation from genome-wide data. PoMo builds on the standard substitution models of DNA evolution but expands the classic alphabet of the four nucleotide bases to include polymorphic states. By doing so, PoMo accounts for ancestral and current intra-population variation, while also accommodating population-level processes ruling the substitution process (e.g. genetic drift, mutations, allelic selection). PoMo has shown to be a valuable tool in several phylogenetic applications but a proof of statistical consistency (and identifiability, a necessary condition for consistency) is lacking. Here, we prove that PoMo is identifiable and, using this result, we further show that the maximum a posteriori (MAP) tree estimator of PoMo is a consistent estimator of the species tree. We complement our theoretical results with a simulated data set mimicking the diversity observed in natural populations exhibiting incomplete lineage sorting. We implemented PoMo in a Bayesian framework and show that the MAP tree easily recovers the true tree for typical numbers of sites that are sampled in genome-wide analyses.

Divergence estimation in the presence of incomplete lineage sorting and migration

10.1101/174342 ◽

2017 ◽

Author(s):

Graham Jones

Keyword(s):

Incomplete Lineage Sorting ◽

Simulated Data ◽

Species Tree ◽

Lineage Sorting ◽

Data Set ◽

Migration Rates ◽

Isolation With Migration ◽

Multispecies Coalescent ◽

Tree Inference ◽

And Migration

AbstractThis paper focuses on the problem of estimating a species tree from multilocus data in the presence of incomplete lineage sorting and migration. We develop a mathematical model similar to IMa2 (Hey 2010) for the relevant evolutionary processes which allows both the the population size parameters and the migration rates between pairs of species tree branches to be integrated out. We then describe a BEAST2 package DENIM which based on this model, and which uses an approximation to sample from the posterior. The approximation is based on the assumption that migrations are rare, and it only samples from certain regions of the posterior which seem likely given this assumption. The method breaks down if there is a lot of migration. Using simulations, Leaché et al 2014 showed migration causes problems for species tree inference using the multispecies coalescent when migration is present but ignored. We re-analyze this simulated data to explore DENIM’s performance, and demonstrate substantial improvements over *BEAST. We also re-analyze an empirical data set. [isolation-with-migration; incomplete lineage sorting; multispecies coalescent; species tree; phylogenetic analysis; Bayesian; Markov chain Monte Carlo]

An ancestral recombination graph of human, Neanderthal, and Denisovan genomes

Science Advances ◽

10.1126/sciadv.abc0776 ◽

2021 ◽

Vol 7 (29) ◽

pp. eabc0776

Author(s):

Nathan K. Schaefer ◽

Beth Shapiro ◽

Richard E. Green

Keyword(s):

Incomplete Lineage Sorting ◽

Simulated Data ◽

Modern Human ◽

Ancestral Recombination Graph ◽

Lineage Sorting ◽

Human Genomes ◽

Genome Wide ◽

A Genome ◽

Graph Inference ◽

And Function

Many humans carry genes from Neanderthals, a legacy of past admixture. Existing methods detect this archaic hominin ancestry within human genomes using patterns of linkage disequilibrium or direct comparison to Neanderthal genomes. Each of these methods is limited in sensitivity and scalability. We describe a new ancestral recombination graph inference algorithm that scales to large genome-wide datasets and demonstrate its accuracy on real and simulated data. We then generate a genome-wide ancestral recombination graph including human and archaic hominin genomes. From this, we generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins either by admixture or incomplete lineage sorting. We find that only 1.5 to 7% of the modern human genome is uniquely human. We also find evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function.

Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l

Systematic Biology ◽

10.1093/sysbio/syaa066 ◽

2020 ◽

Cited By ~ 1

Author(s):

Diego F Morales-Briones ◽

Gudrun Kadereit ◽

Delphine T Tefarikis ◽

Michael J Moore ◽

Stephen A Smith ◽

...

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Species Tree ◽

Data Sets ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Gene Tree Discordance

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]

Terraces in Gene Tree Reconciliation-Based Species Tree Inference

10.1101/2020.04.17.047092 ◽

2020 ◽

Author(s):

Michael J. Sanderson ◽

Michelle M. McMahon ◽

Mike Steel

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Solution Space ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Tree Reconciliation ◽

The Impact

AbstractTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.

Practical Aspects of Phylogenetic Network Analysis Using PhyloNet

10.1101/746362 ◽

2019 ◽

Author(s):

Zhen Cao ◽

Xinhao Liu ◽

Huw A. Ogilvie ◽

Zhi Yan ◽

Luay Nakhleh

Keyword(s):

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Synthetic Data ◽

Simulated Data ◽

Single Species ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Data Set ◽

Types Of Information ◽

Analyze Data

AbstractPhylogenetic networks extend trees to enable simultaneous modeling of both vertical and horizontal evolutionary processes. PhyloNet is a software package that has been under constant development for over 10 years and includes a wide array of functionalities for inferring and analyzing phylogenetic networks. These functionalities differ in terms of the input data they require, the criteria and models they employ, and the types of information they allow to infer about the networks beyond their topologies. Furthermore, PhyloNet includes functionalities for simulating synthetic data on phylogenetic networks, quantifying the topological differences between phylogenetic networks, and evaluating evolutionary hypotheses given in the form of phylogenetic networks.In this paper, we use a simulated data set to illustrate the use of several of PhyloNet’s functionalities and make recommendations on how to analyze data sets and interpret the results when using these functionalities. All inference methods that we illustrate are incomplete lineage sorting (ILS) aware; that is, they account for the potential of ILS in the data while inferring the phylogenetic network. While the models do not include gene duplication and loss, we discuss how the methods can be used to analyze data in the presence of polyploidy.The concept of species is irrelevant for the computational analyses enabled by PhyloNet in that species-individuals mappings are user-defined. Consequently, none of the functionalities in PhyloNet deals with the task of species delimitation. In this sense, the data being analyzed could come from different individuals within a single species, in which case population structure along with potential gene flow is inferred (assuming the data has sufficient signal), or from different individuals sampled from different species, in which case the species phylogeny is being inferred.

An ABBA-BABA Test for Introgression Using Retroposon Insertion Data

10.1101/709477 ◽

2019 ◽

Cited By ~ 2

Author(s):

Mark S. Springer ◽

John Gatesy

Keyword(s):

Dna Sequences ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Data Sets ◽

Lineage Sorting ◽

Sequence Alignments ◽

Data Set ◽

Short Branch ◽

Phylogenetic Methods ◽

Branch Lengths

AbstractDNA sequence alignments provide the majority of data for inferring phylogenetic relationships with both concatenation and coalescence methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroposon insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free. In addition, retroposon insertions satisfy the ‘no intralocus recombination’ assumption of summary coalescence methods because they are singular events and better approximate neutrality relative to DNA sequences commonly applied in phylogenomic work. Retroposons have traditionally been analyzed with phylogenetic methods that ignore incomplete lineage sorting (ILS). Here, we analyze three retroposon data sets for mammals (Placentalia, Laurasiatheria, Balaenopteroidea) with two different ILS-aware methods. The first approach constructs a species tree from retroposon bipartitions with ASTRAL, and the second is a modification of SVD-Quartets. We also develop a χ2 Quartet-Asymmetry Test to detect hybridization using retroposon data. Both coalescence methods recovered the same topology for each of the three data sets. The ASTRAL species tree for Laurasiatheria has consecutive short branch lengths that are consistent with an anomaly zone situation. For the Balaenopteroidea data set, which includes rorquals (Balaenopteridae) and gray whale (Eschrichtiidae), both coalescence methods recovered a topology that supports the paraphyly of Balaenopteridae. Application of the χ2 Quartet-Asymmetry Test to this data set detected 16 different quartets of species for which historical hybridization may be inferred, but significant asymmetry was not detected in the placental root and Laurasiatheria analyses.

Multi-allele species reconstruction using ASTRAL

10.1101/439489 ◽

2018 ◽

Author(s):

Maryam Rabiee ◽

Erfan Sayyari ◽

Siavash Mirarab

Keyword(s):

Incomplete Lineage Sorting ◽

Search Space ◽

Species Tree ◽

Reconstruction Method ◽

Simulation Studies ◽

Gene Trees ◽

Lineage Sorting ◽

Tree Reconstruction ◽

Genome Wide ◽

Quartet Distance

AbstractGenome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈ 1Ne) and ILS is extremely high.

Partitioned Gene-Tree Analyses and Gene-Based Topology Testing Help Resolve Incongruence in a Phylogenomic Study of Host-Specialist Bees (Apidae: Eucerinae)

Molecular Biology and Evolution ◽

10.1093/molbev/msaa277 ◽

2020 ◽

Author(s):

Felipe V Freitas ◽

Michael G Branstetter ◽

Terry Griswold ◽

Eduardo A B Almeida

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Data Set ◽

Analytical Strategy ◽

Analytical Approaches ◽

Phylogenomic Study

Abstract Incongruence among phylogenetic results has become a common occurrence in analyses of genome-scale data sets. Incongruence originates from uncertainty in underlying evolutionary processes (e.g., incomplete lineage sorting) and from difficulties in determining the best analytical approaches for each situation. To overcome these difficulties, more studies are needed that identify incongruences and demonstrate practical ways to confidently resolve them. Here, we present results of a phylogenomic study based on the analysis 197 taxa and 2,526 ultraconserved element (UCE) loci. We investigate evolutionary relationships of Eucerinae, a diverse subfamily of apid bees (relatives of honey bees and bumble bees) with >1,200 species. We sampled representatives of all tribes within the group and >80% of genera, including two mysterious South American genera, Chilimalopsis and Teratognatha. Initial analysis of the UCE data revealed two conflicting hypotheses for relationships among tribes. To resolve the incongruence, we tested concatenation and species tree approaches and used a variety of additional strategies including locus filtering, partitioned gene-trees searches, and gene-based topological tests. We show that within-locus partitioning improves gene tree and subsequent species-tree estimation, and that this approach, confidently resolves the incongruence observed in our data set. After exploring our proposed analytical strategy on eucerine bees, we validated its efficacy to resolve hard phylogenetic problems by implementing it on a published UCE data set of Adephaga (Insecta: Coleoptera). Our results provide a robust phylogenetic hypothesis for Eucerinae and demonstrate a practical strategy for resolving incongruence in other phylogenomic data sets.

Pervasive selection biases inferences of the species tree

10.1101/2020.07.30.228965 ◽

2020 ◽

Author(s):

Rui Borges ◽

Bastien Boussau ◽

Gergely Szollosi ◽

Carolin Kosiol

Keyword(s):

Simulated Data ◽

Population Level ◽

Fruit Fly ◽

Genetic Distances ◽

Species Tree ◽

Molecular Dating ◽

Genome Wide ◽

Genome Wide Data ◽

Gc Bias ◽

The Impact

Despite the importance of natural selection in species' evolutionary history, phylogenetic methods that take into account population-level processes ignore selection. Assuming neutrality is often based on the idea that selection occurs at a minority of loci in the genome and is unlikely to significantly compromise phylogenetic inferences. However, selection might behave more pervasively, as it the case of nearly neutral evolving mutations. Genome-wide processes like GC-bias and some of the variation segregating at the coding regions are known to evolve in the nearly neutral range. As we are now using genome-wide data to estimate species tree, it is just natural to ask whether weak, but pervasive, selection is likely to blur species tree inferences. Here, we employed a polymorphism-aware phylogenetic model, specially tailored for measuring signatures of nucleotide usage biases, to test the impact of nearly neutrally in the substitution process. Analyses with simulated data indicate that while the inferred relationships among species are not significantly compromised, the genetic distances are systematically underestimated, with the deeper nodes suffering more than the younger ones. Such biases have implications for molecular dating. We found signatures of GC-bias considerably affecting the estimated divergence times (up to 21%) of worldwide fruit fly populations. Our findings call for the need to account for nearly neutral forces (or any other form of pervasive selection) when quantifying divergence or dating species evolution.

PoMo: An Allele Frequency-based Approach for Species Tree Estimation

10.1101/016360 ◽

2015 ◽

Author(s):

Nicola De Maio ◽

Dominik Schrempf ◽

Carolin Kosiol

Keyword(s):

Allele Frequency ◽

Phylogenetic Trees ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Efficient Estimation ◽

Species Variation ◽

Species Trees ◽

Lineage Sorting ◽

Genome Wide ◽

Tree Estimation

Incomplete lineage sorting can cause incongruencies of the overall species-level phylogenetic tree with the phylogenetic trees for individual genes or genomic segments. If these incongruencies are not accounted for, it is possible to incur several biases in species tree estimation. Here, we present a simple maximum likelihood approach that accounts for ancestral variation and incomplete lineage sorting. We use a POlymorphisms-aware phylogenetic MOdel (PoMo) that we have recently shown to efficiently estimate mutation rates and fixation biases from within and between-species variation data. We extend this model to perform efficient estimation of species trees. We test the performance of PoMo in several different scenarios of incomplete lineage sorting using simulations and compare it with existing methods both in accuracy and computational speed. In contrast to other approaches, our model does not use coalescent theory but is allele-frequency based. We show that PoMo is well suited for genome-wide species tree estimation and that on such data it is more accurate than previous approaches.