scholarly journals Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge

2018 ◽  
Author(s):  
Erin K. Molloy ◽  
Tandy Warnow

AbstractBackgroundDivide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.ResultsIn this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.ConclusionsTheoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Xilin Yu ◽  
Thien Le ◽  
Sarah A. Christensen ◽  
Erin K. Molloy ◽  
Tandy Warnow

AbstractOne of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS.


Author(s):  
Xilin Yu ◽  
Thien Le ◽  
Sarah A. Christensen ◽  
Erin K. Molloy ◽  
Tandy Warnow

AbstractOne of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS. Exact-RFS-2 and GreedyRFS are available in open source form on Github at github.com/yuxilin51/GreedyRFS.


2019 ◽  
Vol 35 (14) ◽  
pp. i417-i426 ◽  
Author(s):  
Erin K Molloy ◽  
Tandy Warnow

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-17
Author(s):  
Wen Zhong ◽  
Jian Xiong ◽  
Anping Lin ◽  
Lining Xing ◽  
Feilong Chen ◽  
...  

Multiobjective evolutionary algorithms (MOEAs) have witnessed prosperity in solving many-objective optimization problems (MaOPs) over the past three decades. Unfortunately, no one single MOEA equipped with given parameter settings, mating-variation operator, and environmental selection mechanism is suitable for obtaining a set of solutions with excellent convergence and diversity for various types of MaOPs. The reality is that different MOEAs show great differences in handling certain types of MaOPs. Aiming at these characteristics, this paper proposes a flexible ensemble framework, namely, ASES, which is highly scalable for embedding any number of MOEAs to promote their advantages. To alleviate the undesirable phenomenon that some promising solutions are discarded during the evolution process, a big archive that number of contained solutions be far larger than population size is integrated into this ensemble framework to record large-scale nondominated solutions, and also an efficient maintenance strategy is developed to update the archive. Furthermore, the knowledge coming from updating archive is exploited to guide the evolutionary process for different MOEAs, allocating limited computational resources for efficient algorithms. A large number of numerical experimental studies demonstrated superior performance of the proposed ASES. Among 52 test instances, the ASES performs better than all the six baseline algorithms on at least half of the test instances with respect to both metrics hypervolume and inverted generational distance.


2021 ◽  
Author(s):  
Xilin Yu ◽  
Thien Le ◽  
Sarah A. Christensen ◽  
Erin K. Molloy ◽  
Tandy Warnow

Abstract One of the Grand Challenges in Science is the construction of the Tree of Life , an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP -hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a ``supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP -hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS. Exact-RFS-2 and GreedyRFS are available in open source form on Github at github.com/yuxilin51/GreedyRFS


2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
H. D. Yue ◽  
Y. Sun

Cooperative coevolution (CC) is an effective framework for solving large-scale global optimization (LSGO) problems. However, CC with static decomposition method is ineffective for fully nonseparable problems, and CC with dynamic decomposition method to decompose problems is computationally costly. Therefore, a two-stage decomposition (TSD) method is proposed in this paper to decompose LSGO problems using as few computational resources as possible. In the first stage, to decompose problems using low computational resources, a hybrid-pool differential grouping (HPDG) method is proposed, which contains a hybrid-pool-based detection structure (HPDS) and a unit vector-based perturbation (UVP) strategy. In the second stage, to decompose the fully nonseparable problems, a known information-based dynamic decomposition (KIDD) method is proposed. Analytical methods are used to demonstrate that HPDG has lower decomposition complexity compared to state-of-the-art static decomposition methods. Experiments show that CC with TSD is a competitive algorithm for solving LSGO problems.


Author(s):  
Wen-Jing Hong ◽  
Peng Yang ◽  
Ke Tang

AbstractLarge-scale multi-objective optimization problems (MOPs) that involve a large number of decision variables, have emerged from many real-world applications. While evolutionary algorithms (EAs) have been widely acknowledged as a mainstream method for MOPs, most research progress and successful applications of EAs have been restricted to MOPs with small-scale decision variables. More recently, it has been reported that traditional multi-objective EAs (MOEAs) suffer severe deterioration with the increase of decision variables. As a result, and motivated by the emergence of real-world large-scale MOPs, investigation of MOEAs in this aspect has attracted much more attention in the past decade. This paper reviews the progress of evolutionary computation for large-scale multi-objective optimization from two angles. From the key difficulties of the large-scale MOPs, the scalability analysis is discussed by focusing on the performance of existing MOEAs and the challenges induced by the increase of the number of decision variables. From the perspective of methodology, the large-scale MOEAs are categorized into three classes and introduced respectively: divide and conquer based, dimensionality reduction based and enhanced search-based approaches. Several future research directions are also discussed.


2019 ◽  
Author(s):  
Metin Balaban ◽  
Niema Moshiri ◽  
Uyen Mai ◽  
Siavash Mirarab

AbstractClustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available athttps://github.com/niemasd/TreeCluster.


Author(s):  
Giuseppe C. A. DeRose ◽  
Alejandro R. Díaz

Abstract A new solution strategy for topology optimization in 3D elasticity is discussed. This solution strategy uses principles from hierarchical data structures and image analysis to reduce the computational resources necessary to solve large-scale topology optimization problems. The savings in computational resources result from successive use of increasingly detailed hierarchical models starting from a coarse approximation. These models, stored using octree data structures, are used to determine the finite element discretization at a given hierarchy. Through the use of the hierarchical models, large-scale topology optimization problems in 3D elasticity may be solved on desktop workstations.


Sign in / Sign up

Export Citation Format

Share Document