FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models

Erin K Molloy; Tandy Warnow

doi:10.1093/bioinformatics/btaa444

FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models

Bioinformatics ◽

10.1093/bioinformatics/btaa444 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i57-i65 ◽

Cited By ~ 3

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Supplementary Information ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

Abstract Motivation Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. Results We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods. Availability and impementation FastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models

10.1101/835553 ◽

2019 ◽

Cited By ~ 2

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Heterogeneous Datasets ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

AbstractMotivationSpecies tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.ResultsWe present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods.AvailabilityFastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs).

Download Full-text

Comparing Methods for Species Tree Estimation With Gene Duplication and Loss

10.1101/2021.02.05.429947 ◽

2021 ◽

Author(s):

James Willson ◽

Mrinmoy Saha Roddur ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Data Bank ◽

Species Tree ◽

Biological Research ◽

Gene Trees ◽

Species Trees ◽

Tree Inference ◽

Multiple Copies ◽

Gene Duplication And Loss ◽

Tree Estimation

AbstractSpecies tree inference from gene trees is an important part of biological research. One confounding factor in estimating species trees is gene duplication and loss which can lead to gene trees with multiple copies of the same gene. In recent years there have been several new methods developed to address this problem that have substantially improved on earlier methods; however, the best performing methods (ASTRAL-Pro, ASTRID-multi, and FastMulRFS) have not yet been directly compared. In this study, we compare ASTRAL-Pro, ASTRID-multi, and FastMulRFS under a wide variety of conditions. Our study shows that while all three have very good accuracy, nearly the same under many conditions, ASTRAL-Pro and ASTRID-multi are more reliably accurate than FastMuLRFS, and that ASTRID-multi is often faster than ASTRAL-Pro. The datasets generated for this study are freely available in the Illinois Data Bank at https://databank.illinois.edu/datasets/IDB-2418574

Download Full-text

Multispecies Coalescent: Theory and Applications in Phylogenetics

Annual Review of Ecology Evolution and Systematics ◽

10.1146/annurev-ecolsys-012121-095340 ◽

2021 ◽

Vol 52 (1) ◽

Author(s):

Siavash Mirarab ◽

Luay Nakhleh ◽

Tandy Warnow

Keyword(s):

Incomplete Lineage Sorting ◽

Species Tree ◽

Phylogenetic Networks ◽

Biological Research ◽

Annual Review ◽

Publication Date ◽

Gene Trees ◽

Species Trees ◽

Basic Part ◽

Tree Estimation

Species tree estimation is a basic part of many biological research projects, ranging from answering basic evolutionary questions (e.g., how did a group of species adapt to their environments?) to addressing questions in functional biology. Yet, species tree estimation is very challenging, due to processes such as incomplete lineage sorting, gene duplication and loss, horizontal gene transfer, and hybridization, which can make gene trees differ from each other and from the overall evolutionary history of the species. Over the last 10–20 years, there has been tremendous growth in methods and mathematical theory for estimating species trees and phylogenetic networks, and some of these methods are now in wide use. In this survey, we provide an overview of the current state of the art, identify the limitations of existing methods and theory, and propose additional research problems and directions. Expected final online publication date for the Annual Review of Ecology, Evolution, and Systematics, Volume 52 is November 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation

Algorithms ◽

10.3390/a14050148 ◽

2021 ◽

Vol 14 (5) ◽

pp. 148

Author(s):

Minhyuk Park ◽

Paul Zaharias ◽

Tandy Warnow

Keyword(s):

Maximum Likelihood ◽

Gene Tree ◽

Input Sequence ◽

Species Tree ◽

Estimation Methods ◽

Biological Research ◽

Species Trees ◽

Basic Part ◽

Large Trees ◽

Tree Estimation

The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. Here, we evaluate the feasibility of using DTMs to improve the scalability of maximum likelihood (ML) gene tree estimation to large numbers of input sequences. Our study shows distinct differences between the three selected ML codes—RAxML-NG, IQ-TREE 2, and FastTree 2—and shows that good DTM pipeline design can provide advantages over these ML codes on large datasets.

Download Full-text

Comparing Methods for Species Tree Estimation with Gene Duplication and Loss

Algorithms for Computational Biology - Lecture Notes in Computer Science ◽

10.1007/978-3-030-74432-8_8 ◽

2021 ◽

pp. 106-117

Author(s):

James Willson ◽

Mrinmoy Saha Roddur ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Gene Duplication And Loss ◽

Tree Estimation

Download Full-text

Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss

10.1101/821439 ◽

2019 ◽

Cited By ~ 3

Author(s):

Brandon Legried ◽

Erin K. Molloy ◽

Tandy Warnow ◽

Sébastien Roch

Keyword(s):

Gene Duplication ◽

Polynomial Time ◽

Incomplete Lineage Sorting ◽

Data Bank ◽

Polynomial Time Algorithm ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Biological Studies ◽

Gene Duplication And Loss

AbstractPhylogenomics—the estimation of species trees from multilocus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

Journal of Computational Biology ◽

10.1089/cmb.2020.0424 ◽

2020 ◽

Author(s):

Brandon Legried ◽

Erin K. Molloy ◽

Tandy Warnow ◽

Sébastien Roch

Keyword(s):

Gene Duplication ◽

Polynomial Time ◽

Statistical Estimation ◽

Species Trees ◽

Gene Duplication And Loss

Download Full-text

MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R

Bioinformatics ◽

10.1093/bioinformatics/btaa868 ◽

2020 ◽

Author(s):

John A Rhodes ◽

Hector Baños ◽

Jonathan D Mitchell ◽

Elizabeth S Allman

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

R Package ◽

Species Tree ◽

Supplementary Information ◽

Species Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference

Abstract Summary MSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. Availability MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets. Supplementary information Supplementary materials, including example data and analyses, are incorporated into the package.

Download Full-text

Inference of Ancient Whole-Genome Duplications and the Evolution of Gene Duplication and Loss Rates

Molecular Biology and Evolution ◽

10.1093/molbev/msz088 ◽

2019 ◽

Vol 36 (7) ◽

pp. 1384-1404 ◽

Cited By ~ 16

Author(s):

Arthur Zwaenepoel ◽

Yves Van de Peer

Keyword(s):

Maximum Likelihood ◽

Gene Duplication ◽

Gene Tree ◽

Probabilistic Approach ◽

Species Tree ◽

Rate Variation ◽

Whole Genome ◽

Tree Reconciliation ◽

Gene Duplication And Loss ◽

Loss Rates

Abstract Gene tree–species tree reconciliation methods have been employed for studying ancient whole-genome duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the inference of ancient WGDs are unclear. In particular, the effects of variation in gene duplication and loss rates across the species tree have not been considered. Here, we developed a full probabilistic approach for phylogenomic reconciliation-based WGD inference, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and methods are implemented in a maximum likelihood and Bayesian setting and account for variation of duplication and loss rates across the species tree, using methods inspired by phylogenetic divergence time estimation. We applied our newly developed framework to ancient WGDs in land plants and investigated the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs.

Download Full-text