A Bioinformatics Protocol for Quickly Creating Large-Scale Phylogenetic Trees

Author(s):  
Hugo López-Fernández ◽  
Pedro Duque ◽  
Sílvia Henriques ◽  
Noé Vázquez ◽  
Florentino Fdez-Riverola ◽  
...  
2019 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M. Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J. Szöllősi

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.


2007 ◽  
Vol 23 (7) ◽  
pp. 785-788 ◽  
Author(s):  
F. Habib ◽  
A. D. Johnson ◽  
R. Bundschuh ◽  
D. Janies

2020 ◽  
Vol 37 (9) ◽  
pp. 2763-2774 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J Szöllősi

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).  


2003 ◽  
Vol 4 (4) ◽  
pp. 420-423 ◽  
Author(s):  
Rachel E. Bell ◽  
Nir Ben-Tal

Proteins perform many of their biological roles through protein–protein, protein–DNA or protein–ligand interfaces. The identification of the amino acids comprising these interfaces often enhances our understanding of the biological function of the proteins. Many methods for the detection of functional interfaces have been developed, and large-scale analyses have provided assessments of their accuracy. Among them are those that consider the size of the protein interface, its amino acid composition and its physicochemical and geometrical properties. Other methods to this effect use statistical potential functions of pairwise interactions, and evolutionary information. The rationale of the evolutionary approach is that functional and structural constraints impose selective pressure; hence, biologically important interfaces often evolve at a slower pace than do other external regions of the protein. Recently, an algorithm, Rate4Site, and a web-server, ConSurf (http://consurf.tau.ac.il/), for the identification of functional interfaces based on the evolutionary relations among homologous proteins as reflected in phylogenetic trees, were developed in our laboratory. The explicit use of the tree topology and branch lengths makes the method remarkably accurate and sensitive. Here we demonstrate its potency in the identification of the functional interfaces of a hypothetical protein, the structure of which was determined as part of the international structural genomics effort. Finally, we propose to combine complementary procedures, in order to enhance the overall performance of methods for the identification of functional interfaces in proteins.


2011 ◽  
Vol 26 (1) ◽  
pp. 5-42 ◽  
Author(s):  
Peter Bakker ◽  
Aymeric Daval-Markussen ◽  
Mikael Parkvall ◽  
Ingo Plag

In creolist circles, there has been a a long-standing debate whether creoles differ structurally from non-creole languages and thus would form a special class of languages with specific typological properties. This debate about the typological status of creole languages has severely suffered from a lack of systematic empirical study. This paper presents for the first time a number of large-scale empirical investigations of the status of creole languages as a typological class on the basis of different and well-balanced samples of creole and non-creole languages. Using statistical modeling (multiple regression) and recently developed computational tools of quantitative typology (phylogenetic trees and networks), this paper provides robust evidence that creoles indeed form a structurally distinguishable subgroup within the world’s languages. The findings thus seriously challenge approaches that hold that creole languages are structurally indistinguishable from non-creole languages.


2021 ◽  
Vol 25 (02) ◽  
pp. 433-448
Author(s):  
Bruno Eleres Soares ◽  
◽  
Gabriel Nakamura ◽  
◽  

Neotropical stream fishes exhibit a complex evolutionary history and encompass both old and recent lineages. Patterns of species diversity of stream fishes are relatively well-studied for Neotropical streams, but not for patterns of clade distribution and historical factors that structure these assemblages, which are the main interests of phylogenetic ecology. Understanding the evolutionary context of communities provides important insights into large-scale mechanisms that structure them. This review aims to: (i) discuss the main concepts of phylogenetic ecology and its application to Neotropical stream fishes; and (ii) highlight the main methods applied in this background. The first section presents the main phylogenetic hypothesis of fishes and discusses how their gaps in Neotropical stream fishes hinder phylogenetic ecology. Afterward, we discuss the main concepts of phylogenetic ecology (phylogenetic signal, community phylogenetic structure, and phylogenetic diversity), as well as gaps and potential applications of these concepts and tools to understand Neotropical stream fish assemblages. The second section introduces the main methods to address the phylogenetic ecology, including a standardized procedure to edit fish phylogenetic trees, comparative methods, and indices and analytical tools to understand community structure and conservation importance. Finally, we discuss the perspectives to the next years to better understand the Neotropical stream fish assemblages in the light of past and current historical processes.


2019 ◽  
Author(s):  
Ananya Bhattacharjee ◽  
Md. Shamsuzzoha Bayzid

AbstractBackgroundDue to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques match or improve upon the best alternate techniques for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate methods fail.ConclusionsThis study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. The autoencoder based deep learning technique is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).


2015 ◽  
Author(s):  
Alexander Zizka ◽  
Alexandre Antonelli

1. Large-scale species occurrence data from geo-referenced observations and collected specimens are crucial for analyses in ecology, evolution and biogeography. Despite the rapidly growing availability of such data, their use in evolutionary analyses is often hampered by tedious manual classification of point occurrences into operational areas, leading to a lack of reproducibility and concerns regarding data quality. 2. Here we present speciesgeocodeR, a user-friendly R-package for data cleaning, data exploration and data visualization of species point occurrences using discrete operational areas, and linking them to analyses invoking phylogenetic trees. 3. The three core functions of the package are 1) automated and reproducible data cleaning, 2) rapid and reproducible classification of point occurrences into discrete operational areas in an adequate format for subsequent biogeographic analyses, and 3) a comprehensive summary and visualization of species distributions to explore large datasets and ensure data quality. In addition, speciesgeocodeR facilitates the access and analysis of publicly available species occurrence data, widely used operational areas and elevation ranges. Other functionalities include the implementation of minimum occurrence thresholds and the visualization of coexistence patterns and range sizes. SpeciesgeocodeR accompanies a richly illustrated and easy-to-follow tutorial and help functions.


2019 ◽  
Author(s):  
Sankar Subramanian ◽  
Umayal Ramasamy ◽  
David Chen

In the past decades a number of software programs have been developed to deduce the phylogenetic relationship between populations. However, these programs are not suited for large-scale whole genome data. Recently, a few standalone or web applications have been developed to handle genome-wide data, but they were either computationally intensive, dependent on third party software or required significant time and resource of a web server. In the post-genomic era, researchers are able to obtain bioinformatically processed high-quality publication-ready whole genome data for many individuals in a population from next generation sequencing companies due to the reduction in the cost of sequencing and analysis. Such genotype data is typically presented in the Variant Call Format (VCF) and there is no simple software available that uses this data to construct the phylogeny of populations in a short time. To address this limitation, we have developed a one-click user-friendly software, VCF2PopTree that uses gnome-wide SNPs to construct and display phylogenetic trees in seconds to minutes. For example, it reads a 1 GB VCF file and draws a tree in less than 5 minutes. VCF2PopTree accepts genotype data from a local machine, constructs a tree using UPGMA and Neighbour-Joining algorithms and displays it on a web-browser. It also produces pairwise-diversity matrix in MEGA and PHYLIP file formats as well as trees in the Newick format which could be directly used by other popular phylogenetic software programs. The software including the source code, a test VCF input file and short documentation are available at: https://github.com/sansubs/vcf2pop.


Sign in / Sign up

Export Citation Format

Share Document