species tree inference
Recently Published Documents


TOTAL DOCUMENTS

56
(FIVE YEARS 16)

H-INDEX

17
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Baqiao Liu ◽  
Tandy Warnow

Species tree inference under the multi-species coalescent (MSC) model is a basic step in biological discovery. Despite the developments in recent years of methods that are proven statistically consistent and that have high accuracy, large datasets create computational challenges. Although there is gener- ally some information available about the species trees that could be used to speed up the estimation, only one method, ASTRAL-J, a recent development in the ASTRAL family of methods, is able to use this information. Here we describe two new methods, NJst-J and FASTRAL-J, that can estimate the species tree given partial knowledge of the species tree in the form of a non-binary unrooted constraint tree.. We show that both NJst-J and FASTRAL-J are much faster than ASTRAL-J and we prove that all three methods are statistically consistent under the multi-species coalescent model subject to this constraint. Our extensive simulation study shows that both FASTRAL-J and NJst-J provide advantages over ASTRAL-J: both are faster (and NJst-J is particularly fast), and FASTRAL-J is generally at least as accurate as ASTRAL-J. An analysis of the Avian Phylogenomics project dataset with 48 species and 14,446 genes presents additional evidence of the value of FASTRAL-J over ASTRAL-J (and both over ASTRAL), with dramatic reductions in running time (20 hours for default ASTRAL, and minutes or seconds for ASTRAL-J and FASTRAL-J, respectively). Availability: FASTRAL-J and NJst-J are available in open source form at https://github.com/ RuneBlaze/FASTRAL-constrained and https://github.com/RuneBlaze/NJst-constrained. Locations of the datasets used in this study and detailed commands needed to reproduce the study are provided in the supplementary materials at http://tandy.cs.illinois.edu/baqiao-suppl.pdf.


Author(s):  
Philipp Hühn ◽  
Markus S. Dillenberger ◽  
Michael Gerschwitz-Eidt ◽  
Elvira Hörandl ◽  
Jessica A. Los ◽  
...  

2021 ◽  
Author(s):  
Megan L Smith ◽  
Dan Vanderpool ◽  
Matthew W. Hahn

Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs by using clustering approaches and retaining families with a single sequence from each species. However, this approach can severely limit the amount of data available by excluding larger families. Recent methodological advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several popular methods for species tree inference appear to be robust to the inclusion of paralogs, and hence could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference using genomes from 26 primate species. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data (i.e., including orthologs and paralogs). We explore several species tree inference methods, finding that across all nodes of the tree except one, identical trees are returned across nearly all datasets and methods. As in previous studies, the relationships among Platyrrhini remain contentious; however, the tree inference methods matter more than the dataset used. We also assess the effects of each dataset on branch length estimates, measures of phylogenetic uncertainty and concordance, and in detecting introgression. Our results demonstrate that using data from larger gene families drastically increases the number of genes available for phylogenetic inference and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression.


Author(s):  
James Willson ◽  
Mrinmoy Saha Roddur ◽  
Baqiao Liu ◽  
Paul Zaharias ◽  
Tandy Warnow

Abstract Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang Y. and Smith S.A. (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition. [Concatenation analysis; gene duplication and loss; species tree inference; summary method.]


Author(s):  
Chao Zhang ◽  
Celine Scornavacca ◽  
Erin K Molloy ◽  
Siavash Mirarab

2021 ◽  
Author(s):  
Benoit Morel ◽  
Paul Schade ◽  
Sarah Lutteropp ◽  
Tom A. Williams ◽  
Gergely J. Szöllösi ◽  
...  

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.


2020 ◽  
Author(s):  
Matthew Wascher ◽  
Laura S. Kubatko

AbstractA common question that arises when inferring species-level phylogenies from genome-scale data is whether selection acting on certain parts of the genome could create a bias in the inferred phylogeny. While most methods for species tree inference currently assume the multispecies coalescent (MSC), all methods that we are aware of utilize only the neutral coalescent process. If selection is in fact present, failure to adequately model it could introduce substantial bias. We work toward rigorously addressing this question using mathematical theory by deriving a version of the coalescent including selection and mutation as a limiting approximation of the Wright-Fisher model with selection and mutation, and showing that it can be used to closely approximate the distribution of coalescent times in the presence of selection and mutation. We confirm the adequacy of the approximation with a simulation study, and discuss its implications for species tree inference. Our results show that in a general class containing many cases of interest, selection has only a small impact on the coalescent process, and ignoring selection when it is present does not have a substantial negative impact on inference of the species tree topology.


2020 ◽  
Author(s):  
Jeremy M. Brown ◽  
Genevieve G. Mount ◽  
Kyle A. Gallivan ◽  
James Wilgenbusch

All phylogenetic studies are built around sets of trees. Tree sets carry different kinds of information depending on the data and approaches used to generate them, but ultimately the variation they contain and their structure is what drives new phylogenetic insights. In order to better understand the variation in and structure of phylogenetic tree sets, we need tools that are generic, flexible, and exploratory. These tools can serve as natural complements to more formal, statistical investigations and allow us to flag surprising or unexpected observations, better understand the results of model-based studies, as well as build intuition. Here, we describe such a set of tools and provide examples of how they can be applied to relevant questions in phylogenetics, phylogenomics, and species-tree inference. These tools include both visualization techniques and quantitative summaries and are currently implemented in the TreeScaper software package (Huang et al. 2016).


Sign in / Sign up

Export Citation Format

Share Document