DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition

Systematic Biology ◽

10.1093/sysbio/syab070 ◽

2021 ◽

Cited By ~ 1

Author(s):

James Willson ◽

Mrinmoy Saha Roddur ◽

Baqiao Liu ◽

Paul Zaharias ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Gene Family ◽

Gene Tree ◽

Single Copy ◽

Species Tree ◽

Family Tree ◽

Tree Inference ◽

Gene Duplication And Loss ◽

Species Tree Inference ◽

Family Trees

Abstract Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang Y. and Smith S.A. (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition. [Concatenation analysis; gene duplication and loss; species tree inference; summary method.]

Download Full-text

SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss.

10.1101/2021.03.29.437460 ◽

2021 ◽

Author(s):

Benoit Morel ◽

Paul Schade ◽

Sarah Lutteropp ◽

Tom A. Williams ◽

Gergely J. Szöllösi ◽

...

Keyword(s):

Maximum Likelihood ◽

Gene Family ◽

Gene Families ◽

Species Tree ◽

Likelihood Method ◽

Gene Trees ◽

Informative Signal ◽

Tree Inference ◽

Species Tree Inference ◽

Family Trees

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.

Download Full-text

Impact of gene family evolutionary histories on phylogenetic species tree inference by gene tree parsimony

Molecular Phylogenetics and Evolution ◽

10.1016/j.ympev.2015.12.002 ◽

2016 ◽

Vol 96 ◽

pp. 9-16 ◽

Cited By ~ 1

Author(s):

Tao Shi

Keyword(s):

Gene Family ◽

Gene Tree ◽

Species Tree ◽

Phylogenetic Species ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

ASTRAL-Pro: quartet-based species tree inference despite paralogy

10.1101/2019.12.12.874727 ◽

2019 ◽

Cited By ~ 4

Author(s):

Chao Zhang ◽

Celine Scornavacca ◽

Erin K. Molloy ◽

Siavash Mirarab

Keyword(s):

Gene Duplication ◽

Gene Loss ◽

Incomplete Lineage Sorting ◽

Single Copy ◽

Species Tree ◽

Alternative Methods ◽

Gene Trees ◽

Species Trees ◽

Tree Inference ◽

Species Tree Inference

AbstractSpecies tree inference via summary methods that combine gene trees has become an increasingly common analysis in recent phylogenomic studies. This broad adoption has been partly due to the greater availability of genome-wide data and ample recognition that gene trees and species trees can differ due to biological processes such as gene duplication and gene loss. This increase has also been encouraged by the recent development of accurate and scalable summary methods, such as ASTRAL. However, most of these methods, including ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. In this paper, we introduce a measure of quartet similarity between single-copy and multi-copy trees (accounting for orthology and paralogy relationships) that can be optimized via a scalable dynamic programming similar to the one used by ASTRAL. We then present a new quartet-based species tree inference method: ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs). By studying its performance on an extensive collection of simulated datasets and on a real plant dataset, we show that ASTRAL-Pro is more accurate than alternative methods when gene trees differ from the species tree due to the simultaneous presence of gene duplication, gene loss, incomplete lineage sorting, and estimation errors.

Download Full-text

Inference of Ancient Whole-Genome Duplications and the Evolution of Gene Duplication and Loss Rates

Molecular Biology and Evolution ◽

10.1093/molbev/msz088 ◽

2019 ◽

Vol 36 (7) ◽

pp. 1384-1404 ◽

Cited By ~ 16

Author(s):

Arthur Zwaenepoel ◽

Yves Van de Peer

Keyword(s):

Maximum Likelihood ◽

Gene Duplication ◽

Gene Tree ◽

Probabilistic Approach ◽

Species Tree ◽

Rate Variation ◽

Whole Genome ◽

Tree Reconciliation ◽

Gene Duplication And Loss ◽

Loss Rates

Abstract Gene tree–species tree reconciliation methods have been employed for studying ancient whole-genome duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the inference of ancient WGDs are unclear. In particular, the effects of variation in gene duplication and loss rates across the species tree have not been considered. Here, we developed a full probabilistic approach for phylogenomic reconciliation-based WGD inference, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and methods are implemented in a maximum likelihood and Bayesian setting and account for variation of duplication and loss rates across the species tree, using methods inspired by phylogenetic divergence time estimation. We applied our newly developed framework to ancient WGDs in land plants and investigated the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs.

Download Full-text

Gene tree correction for reconciliation and species tree inference: Complexity and algorithms

Journal of Discrete Algorithms ◽

10.1016/j.jda.2013.06.001 ◽

2014 ◽

Vol 25 ◽

pp. 51-65 ◽

Cited By ~ 15

Author(s):

Riccardo Dondi ◽

Nadia El-Mabrouk ◽

Krister M. Swenson

Keyword(s):

Gene Tree ◽

Species Tree ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

Phylogeny of the cycads based on multiple single-copy nuclear genes: congruence of concatenated parsimony, likelihood and species tree inference methods

Annals of Botany ◽

10.1093/aob/mct192 ◽

2013 ◽

Vol 112 (7) ◽

pp. 1263-1278 ◽

Cited By ~ 60

Author(s):

Dayana E. Salas-Leiva ◽

Alan W. Meerow ◽

Michael Calonje ◽

M. Patrick Griffith ◽

Javier Francisco-Ortega ◽

...

Keyword(s):

Single Copy ◽

Species Tree ◽

Nuclear Genes ◽

Tree Inference ◽

Inference Methods ◽

Species Tree Inference

Download Full-text

Phylogenetic conflicts, combinability, and deep phylogenomics in plants

10.1101/371930 ◽

2018 ◽

Cited By ~ 1

Author(s):

Stephen A. Smith ◽

Nathanael Walker-Hale ◽

Joseph F. Walker ◽

Joseph W. Brown

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Signal ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Filtering ◽

Tree Inference ◽

Tree Methods ◽

Inference Methods ◽

Species Tree Inference

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.

Download Full-text

Species Tree Inference under the Multispecies Coalescent on Data with Paralogs is Accurate

10.1101/498378 ◽

2018 ◽

Cited By ~ 10

Author(s):

Zhi Yan ◽

Peng Du ◽

Matthew W. Hahn ◽

Luay Nakhleh

Keyword(s):

Incomplete Lineage Sorting ◽

Single Copy ◽

Species Tree ◽

Biological Data ◽

Lineage Sorting ◽

Multispecies Coalescent ◽

Gene Copies ◽

Tree Inference ◽

Inference Methods ◽

Species Tree Inference

AbstractThe multispecies coalescent (MSC) has emerged as a powerful and desirable framework for species tree inference in phylogenomic studies. Under this framework, the data for each locus is assumed to consist of orthologous, single-copy genes, and heterogeneity across loci is assumed to be due to incomplete lineage sorting (ILS). These assumptions have led biologists that use ILS-aware inference methods, whether based directly on the MSC or proven to be statistically consistent under it (collectively referred to here as MSC-based methods), to exclude all loci that are present in more than a single copy in any of the studied genomes. Furthermore, such analyses entail orthology assignment to avoid the potential of hidden paralogy in the data. The question we seek to answer in this study is: What happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two biological data sets, we show that running such methods on data with paralogs provide very accurate results, either by treating all gene copies within a family as alleles from multiple individuals or by randomly selecting one copy per species. Our results have significant implications for the use of MSC-based phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci, thus greatly increasing the amount of data that can be used. [Multispecies coalescent; incomplete lineage sorting; gene duplication and loss; orthology; paralogy.]

Download Full-text

Using all gene families vastly expands data available for phylogenomic inference in primates

10.1101/2021.09.22.461252 ◽

2021 ◽

Author(s):

Megan L Smith ◽

Dan Vanderpool ◽

Matthew W. Hahn

Keyword(s):

Branch Length ◽

Gene Families ◽

Phylogenetic Inference ◽

Single Copy ◽

Decomposition Methods ◽

Species Tree ◽

Primate Species ◽

Tree Inference ◽

Inference Methods ◽

Species Tree Inference

Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs by using clustering approaches and retaining families with a single sequence from each species. However, this approach can severely limit the amount of data available by excluding larger families. Recent methodological advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several popular methods for species tree inference appear to be robust to the inclusion of paralogs, and hence could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference using genomes from 26 primate species. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data (i.e., including orthologs and paralogs). We explore several species tree inference methods, finding that across all nodes of the tree except one, identical trees are returned across nearly all datasets and methods. As in previous studies, the relationships among Platyrrhini remain contentious; however, the tree inference methods matter more than the dataset used. We also assess the effects of each dataset on branch length estimates, measures of phylogenetic uncertainty and concordance, and in detecting introgression. Our results demonstrate that using data from larger gene families drastically increases the number of genes available for phylogenetic inference and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression.

Download Full-text

COALESCENT-BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD

Evolution ◽

10.1111/j.1558-5646.2011.01476.x ◽

2011 ◽

Vol 66 (3) ◽

pp. 763-775 ◽

Cited By ~ 104

Author(s):

Yufeng Wu

Keyword(s):

Maximum Likelihood ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Lineage Sorting ◽

Tree Inference ◽

Tree Topologies ◽

Species Tree Inference

Download Full-text