TreeCluster: clustering biological sequences using phylogenetic trees

AbstractClustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available athttps://github.com/niemasd/TreeCluster.

Download Full-text

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text

A Comparative Study of Redundant Constraints Identification Methods in Linear Programming Problems

Mathematical Problems in Engineering ◽

10.1155/2010/723402 ◽

2010 ◽

Vol 2010 ◽

pp. 1-16 ◽

Cited By ~ 6

Author(s):

Paulraj S. ◽

Sumathi P.

Keyword(s):

Linear Programming ◽

Large Scale ◽

Optimization Problems ◽

Inequality Constraints ◽

Computational Effort ◽

Linear Functions ◽

Redundant Constraints ◽

Computer Scientists ◽

Constraints Solving ◽

Linear Programming Problems

The objective function and the constraints can be formulated as linear functions of independent variables in most of the real-world optimization problems. Linear Programming (LP) is the process of optimizing a linear function subject to a finite number of linear equality and inequality constraints. Solving linear programming problems efficiently has always been a fascinating pursuit for computer scientists and mathematicians. The computational complexity of any linear programming problem depends on the number of constraints and variables of the LP problem. Quite often large-scale LP problems may contain many constraints which are redundant or cause infeasibility on account of inefficient formulation or some errors in data input. The presence of redundant constraints does not alter the optimal solutions(s). Nevertheless, they may consume extra computational effort. Many researchers have proposed different approaches for identifying the redundant constraints in linear programming problems. This paper compares five of such methods and discusses the efficiency of each method by solving various size LP problems and netlib problems. The algorithms of each method are coded by using a computer programming language C. The computational results are presented and analyzed in this paper.

Download Full-text

In SilicoIdentification of Functional Protein Interfaces

Comparative and Functional Genomics ◽

10.1002/cfg.309 ◽

2003 ◽

Vol 4 (4) ◽

pp. 420-423 ◽

Cited By ~ 14

Author(s):

Rachel E. Bell ◽

Nir Ben-Tal

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Hypothetical Protein ◽

Evolutionary Information ◽

Structural Constraints ◽

Functional Protein ◽

Homologous Proteins ◽

Geometrical Properties ◽

Protein Interfaces ◽

Branch Lengths

Proteins perform many of their biological roles through protein–protein, protein–DNA or protein–ligand interfaces. The identification of the amino acids comprising these interfaces often enhances our understanding of the biological function of the proteins. Many methods for the detection of functional interfaces have been developed, and large-scale analyses have provided assessments of their accuracy. Among them are those that consider the size of the protein interface, its amino acid composition and its physicochemical and geometrical properties. Other methods to this effect use statistical potential functions of pairwise interactions, and evolutionary information. The rationale of the evolutionary approach is that functional and structural constraints impose selective pressure; hence, biologically important interfaces often evolve at a slower pace than do other external regions of the protein. Recently, an algorithm, Rate4Site, and a web-server, ConSurf (http://consurf.tau.ac.il/), for the identification of functional interfaces based on the evolutionary relations among homologous proteins as reflected in phylogenetic trees, were developed in our laboratory. The explicit use of the tree topology and branch lengths makes the method remarkably accurate and sensitive. Here we demonstrate its potency in the identification of the functional interfaces of a hypothetical protein, the structure of which was determined as part of the international structural genomics effort. Finally, we propose to combine complementary procedures, in order to enhance the overall performance of methods for the identification of functional interfaces in proteins.

Download Full-text

Solving large-scale optimization problems by divide-and-conquer neural networks

10.1109/ijcnn.1989.118626 ◽

1989 ◽

Cited By ~ 10

Author(s):

Foo ◽

Szu

Keyword(s):

Neural Networks ◽

Large Scale ◽

Optimization Problems ◽

Divide And Conquer ◽

Large Scale Optimization ◽

Scale Optimization

Download Full-text

Efficient Merging of Genome Profile Alignments

10.1101/309047 ◽

2018 ◽

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Data Sets ◽

Whole Genome ◽

Multiple Sequence ◽

Construction Methods ◽

Current Implementation ◽

Whole Genomes

AbstractMotivationWhole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.ResultsHere, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.AvailabilityGPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of [email protected]

Download Full-text

Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge

10.1101/469130 ◽

2018 ◽

Cited By ~ 3

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Optimization Problems ◽

Distance Matrix ◽

Divide And Conquer ◽

Estimation Methods ◽

Supertree Method ◽

Base Method ◽

Time Extension ◽

Phylogeny Estimation ◽

Computational Resources

AbstractBackgroundDivide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.ResultsIn this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and “concatenation” using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.ConclusionsTheoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).

Download Full-text

Efficient merging of genome profile alignments

Bioinformatics ◽

10.1093/bioinformatics/btz377 ◽

2019 ◽

Vol 35 (14) ◽

pp. i71-i80

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Supplementary Information ◽

Whole Genome ◽

Multiple Sequence ◽

Genome Profile ◽

Construction Methods ◽

Profile Alignment

Abstract Motivation Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles. Results Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool. Availability and implementation GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evolutionary Computation for Large-scale Multi-objective Optimization: A Decade of Progresses

International Journal of Automation and Computing ◽

10.1007/s11633-020-1253-0 ◽

2021 ◽

Author(s):

Wen-Jing Hong ◽

Peng Yang ◽

Ke Tang

Keyword(s):

Evolutionary Computation ◽

Real World ◽

Large Scale ◽

Optimization Problems ◽

Divide And Conquer ◽

Research Progress ◽

Small Scale ◽

Multi Objective Optimization ◽

Multi Objective ◽

Decision Variables

AbstractLarge-scale multi-objective optimization problems (MOPs) that involve a large number of decision variables, have emerged from many real-world applications. While evolutionary algorithms (EAs) have been widely acknowledged as a mainstream method for MOPs, most research progress and successful applications of EAs have been restricted to MOPs with small-scale decision variables. More recently, it has been reported that traditional multi-objective EAs (MOEAs) suffer severe deterioration with the increase of decision variables. As a result, and motivated by the emergence of real-world large-scale MOPs, investigation of MOEAs in this aspect has attracted much more attention in the past decade. This paper reviews the progress of evolutionary computation for large-scale multi-objective optimization from two angles. From the key difficulties of the large-scale MOPs, the scalability analysis is discussed by focusing on the performance of existing MOEAs and the challenges induced by the increase of the number of decision variables. From the perspective of methodology, the large-scale MOEAs are categorized into three classes and introduced respectively: divide and conquer based, dimensionality reduction based and enhanced search-based approaches. Several future research directions are also discussed.

Download Full-text

MAGUS: Multiple sequence Alignment using Graph clUStering

Bioinformatics ◽

10.1093/bioinformatics/btaa992 ◽

2020 ◽

Author(s):

Vladimir Smirnov ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Graph Clustering ◽

Divide And Conquer ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Full Dataset ◽

A New Technique

Abstract Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability and implementation MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large Scale Plant Optimization Problems

10.23919/acc.1988.4790048 ◽

1988 ◽

Cited By ~ 1

Author(s):

Paul Cronin ◽

Harry Woerde ◽

Rob Vasbinder

Keyword(s):

Large Scale ◽

Optimization Problems ◽

Plant Optimization

Download Full-text