Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge

AbstractBackgroundDivide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.ResultsIn this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.ConclusionsTheoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).

Download Full-text

Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00189-2 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Xilin Yu ◽

Thien Le ◽

Sarah A. Christensen ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Optimization Problems ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Tree Of Life ◽

Divide And Conquer ◽

Mcmc Methods ◽

Supertree Method ◽

Phylogeny Estimation ◽

Source Form ◽

Life On Earth

AbstractOne of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS.

Download Full-text

Advancing Divide-and-Conquer Phylogeny Estimation using Robinson-Foulds Supertrees

10.1101/2020.05.16.099895 ◽

2020 ◽

Cited By ~ 1

Author(s):

Xilin Yu ◽

Thien Le ◽

Sarah A. Christensen ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Optimization Problems ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Tree Of Life ◽

Divide And Conquer ◽

Greedy Heuristic ◽

Mcmc Methods ◽

Supertree Method ◽

Phylogeny Estimation ◽

Source Form

AbstractOne of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS. Exact-RFS-2 and GreedyRFS are available in open source form on Github at github.com/yuxilin51/GreedyRFS.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Big Archive-Assisted Ensemble of Many-Objective Evolutionary Algorithms

Complexity ◽

10.1155/2021/6614283 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Wen Zhong ◽

Jian Xiong ◽

Anping Lin ◽

Lining Xing ◽

Feilong Chen ◽

...

Keyword(s):

Evolutionary Algorithms ◽

Large Scale ◽

Optimization Problems ◽

Experimental Studies ◽

Evolutionary Process ◽

Superior Performance ◽

Nondominated Solutions ◽

Environmental Selection ◽

Computational Resources ◽

Better Than

Multiobjective evolutionary algorithms (MOEAs) have witnessed prosperity in solving many-objective optimization problems (MaOPs) over the past three decades. Unfortunately, no one single MOEA equipped with given parameter settings, mating-variation operator, and environmental selection mechanism is suitable for obtaining a set of solutions with excellent convergence and diversity for various types of MaOPs. The reality is that different MOEAs show great differences in handling certain types of MaOPs. Aiming at these characteristics, this paper proposes a flexible ensemble framework, namely, ASES, which is highly scalable for embedding any number of MOEAs to promote their advantages. To alleviate the undesirable phenomenon that some promising solutions are discarded during the evolution process, a big archive that number of contained solutions be far larger than population size is integrated into this ensemble framework to record large-scale nondominated solutions, and also an efficient maintenance strategy is developed to update the archive. Furthermore, the knowledge coming from updating archive is exploited to guide the evolutionary process for different MOEAs, allocating limited computational resources for efficient algorithms. A large number of numerical experimental studies demonstrated superior performance of the proposed ASES. Among 52 test instances, the ASES performs better than all the six baseline algorithms on at least half of the test instances with respect to both metrics hypervolume and inverted generational distance.

Download Full-text

Solving large-scale optimization problems by divide-and-conquer neural networks

10.1109/ijcnn.1989.118626 ◽

1989 ◽

Cited By ~ 10

Author(s):

Foo ◽

Szu

Keyword(s):

Neural Networks ◽

Large Scale ◽

Optimization Problems ◽

Divide And Conquer ◽

Large Scale Optimization ◽

Scale Optimization

Download Full-text

Using Robinson-Foulds Supertrees in Divide-and-Conquer Phylogeny Estimation

10.21203/rs.3.rs-174421/v1 ◽

2021 ◽

Author(s):

Xilin Yu ◽

Thien Le ◽

Sarah A. Christensen ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Optimization Problems ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Tree Of Life ◽

Divide And Conquer ◽

Greedy Heuristic ◽

Mcmc Methods ◽

Np Hard ◽

Phylogeny Estimation ◽

Source Form

Abstract One of the Grand Challenges in Science is the construction of the Tree of Life , an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP -hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a ``supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP -hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS. Exact-RFS-2 and GreedyRFS are available in open source form on Github at github.com/yuxilin51/GreedyRFS

Download Full-text

Cooperative Coevolution with Two-Stage Decomposition for Large-Scale Global Optimization Problems

Discrete Dynamics in Nature and Society ◽

10.1155/2021/2653807 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

H. D. Yue ◽

Y. Sun

Keyword(s):

Global Optimization ◽

Decomposition Method ◽

Large Scale ◽

Optimization Problems ◽

Decomposition Methods ◽

Unit Vector ◽

Cooperative Coevolution ◽

Two Stage ◽

Computational Resources ◽

Dynamic Decomposition

Cooperative coevolution (CC) is an effective framework for solving large-scale global optimization (LSGO) problems. However, CC with static decomposition method is ineffective for fully nonseparable problems, and CC with dynamic decomposition method to decompose problems is computationally costly. Therefore, a two-stage decomposition (TSD) method is proposed in this paper to decompose LSGO problems using as few computational resources as possible. In the first stage, to decompose problems using low computational resources, a hybrid-pool differential grouping (HPDG) method is proposed, which contains a hybrid-pool-based detection structure (HPDS) and a unit vector-based perturbation (UVP) strategy. In the second stage, to decompose the fully nonseparable problems, a known information-based dynamic decomposition (KIDD) method is proposed. Analytical methods are used to demonstrate that HPDG has lower decomposition complexity compared to state-of-the-art static decomposition methods. Experiments show that CC with TSD is a competitive algorithm for solving LSGO problems.

Download Full-text

Evolutionary Computation for Large-scale Multi-objective Optimization: A Decade of Progresses

International Journal of Automation and Computing ◽

10.1007/s11633-020-1253-0 ◽

2021 ◽

Author(s):

Wen-Jing Hong ◽

Peng Yang ◽

Ke Tang

Keyword(s):

Evolutionary Computation ◽

Real World ◽

Large Scale ◽

Optimization Problems ◽

Divide And Conquer ◽

Research Progress ◽

Small Scale ◽

Multi Objective Optimization ◽

Multi Objective ◽

Decision Variables

AbstractLarge-scale multi-objective optimization problems (MOPs) that involve a large number of decision variables, have emerged from many real-world applications. While evolutionary algorithms (EAs) have been widely acknowledged as a mainstream method for MOPs, most research progress and successful applications of EAs have been restricted to MOPs with small-scale decision variables. More recently, it has been reported that traditional multi-objective EAs (MOEAs) suffer severe deterioration with the increase of decision variables. As a result, and motivated by the emergence of real-world large-scale MOPs, investigation of MOEAs in this aspect has attracted much more attention in the past decade. This paper reviews the progress of evolutionary computation for large-scale multi-objective optimization from two angles. From the key difficulties of the large-scale MOPs, the scalability analysis is discussed by focusing on the performance of existing MOEAs and the challenges induced by the increase of the number of decision variables. From the perspective of methodology, the large-scale MOEAs are categorized into three classes and introduced respectively: divide and conquer based, dimensionality reduction based and enhanced search-based approaches. Several future research directions are also discussed.

Download Full-text

TreeCluster: clustering biological sequences using phylogenetic trees

10.1101/591388 ◽

2019 ◽

Cited By ~ 3

Author(s):

Metin Balaban ◽

Niema Moshiri ◽

Uyen Mai ◽

Siavash Mirarab

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Hiv Transmission ◽

Optimization Problems ◽

Divide And Conquer ◽

Multiple Sequence ◽

Branch Lengths ◽

Computer Scientists ◽

Minimum Number ◽

Microbiome Data

AbstractClustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available athttps://github.com/niemasd/TreeCluster.

Download Full-text

Hierarchical Solution of Large-Scale Three-Dimensional Topology Optimization Problems

Volume 3: 22nd Design Automation Conference ◽

10.1115/96-detc/dac-1486 ◽

1996 ◽

Author(s):

Giuseppe C. A. DeRose ◽

Alejandro R. Díaz

Keyword(s):

Topology Optimization ◽

Data Structures ◽

Hierarchical Models ◽

Large Scale ◽

Optimization Problems ◽

Three Dimensional ◽

Solution Strategy ◽

Element Discretization ◽

3D Elasticity ◽

Computational Resources

Abstract A new solution strategy for topology optimization in 3D elasticity is discussed. This solution strategy uses principles from hierarchical data structures and image analysis to reduce the computational resources necessary to solve large-scale topology optimization problems. The savings in computational resources result from successive use of increasingly detailed hierarchical models starting from a coarse approximation. These models, stored using octree data structures, are used to determine the finite element discretization at a given hierarchy. Through the use of the hierarchical models, large-scale topology optimization problems in 3D elasticity may be solved on desktop workstations.

Download Full-text