Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data

SummaryWe present GENESIS, a library for working with phylogenetic data, and GAPPA, an accompanying command line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies, and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested, and field-proven.Availability and ImplementationBoth GENESIS and GAPPA are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/[email protected] and [email protected].

Download Full-text

Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data

Bioinformatics ◽

10.1093/bioinformatics/btaa070 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3263-3265 ◽

Cited By ~ 14

Author(s):

Lucas Czech ◽

Pierre Barbera ◽

Alexandros Stamatakis

Keyword(s):

Phylogenetic Trees ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Computationally Efficient ◽

Data Types ◽

Low Level ◽

Phylogenetic Placement ◽

Command Line Tool ◽

High Level

Abstract Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CONSTAX2: Improved taxonomic classification of environmental DNA markers

10.1101/2021.02.15.430803 ◽

2021 ◽

Author(s):

Julian Liber ◽

Gregory Bonito ◽

Gian Maria Niccolò Benucci

Keyword(s):

Dna Markers ◽

Environmental Dna ◽

Taxonomic Classification ◽

Command Line ◽

Consensus Approach ◽

Link Type ◽

Command Line Tool ◽

High Level ◽

Taxonomic Annotation

SummaryCONSTAX - the CONSensus TAXonomy classifier - was developed for accurate and reproducible taxonomic annotation of fungal rDNA amplicons and is based upon a consensus approach of RDP, SINTAX and UTAX algorithms. CONSTAX2 can be used to classify prokaryotes and incorporates BLAST-based classifiers to reduce classification errors. Additionally, CONSTAX2 implements a conda-installable, command line tool with improved classification metrics, faster training, multithreading support, capacity to incorporate external taxonomic databases, new isolate matching and high-level taxonomy tools, replete with documentation and example tutorials.Availability and ImplementationCONSTAX2 is available at https://github.com/liberjul/CONSTAXv2, and is packaged for Linux and MacOS from Bioconda. A tutorial and documentation are available at https://constax.readthedocs.io/en/latest/.

Download Full-text

PhySortR: a fast, flexible tool for sorting phylogenetic trees in R

10.7287/peerj.preprints.1609v1 ◽

2015 ◽

Author(s):

Timothy G Stephens ◽

Debashish Bhattacharya ◽

Mark A Ragan ◽

Cheong Xin Chan

Keyword(s):

Phylogenetic Trees ◽

R Package ◽

Command Line ◽

Flexible Tool ◽

Command Line Tool ◽

Whole Tree

A frequent bottleneck in interpreting phylogenomic output is the need to screen often thousands of trees for features of interest, such as robust clades of specific taxa, as evidence of monophyletic relationship and/or reticulated evolution. Here we present PhySortR, a fast, flexible R package for sorting phylogenetic trees. Unlike existing utilities, PhySortR allows for identification of both exclusive and non-exclusive clades uniting the target taxa, with customisable options to assess clades within the context of the whole tree. PhySortR is a command-line tool that is freely available, highly scalable, and easily automatable.

Download Full-text

CoRC: the COPASI R Connector

Bioinformatics ◽

10.1093/bioinformatics/btab033 ◽

2021 ◽

Author(s):

Jonas Förster ◽

Frank T Bergmann ◽

Jürgen Pahle

Keyword(s):

Graphical User Interface ◽

Academic Research ◽

R Package ◽

Supplementary Information ◽

Command Line ◽

Graphical Interface ◽

Thought Process ◽

Extensive Analysis ◽

Command Line Tool ◽

High Level

Abstract Motivation COPASI is a biochemical simulator and model analyzer which has found widespread use in academic research, teaching and beyond. One of COPASI’s strengths is its graphical user interface, and this is what most users work with. COPASI also provides a command-line tool. So far, an intuitive scripting interface that allows the creation and documentation of systems biology workflows was missing though. Results We have developed CoRC, the COPASI R Connector, an R package which provides a high-level scripting interface for COPASI. It closely mirrors the thought process of a (graphical interface) user and should therefore be very easy to use. This allows for complex workflows to be reproducibly scripted, utilizing COPASI’s powerful analytic toolset in combination with R’s extensive analysis and package ecosystem. Availability and implementation CoRC is a free and open-source R package, available via GitHub at https://jpahle.github.io/CoRC/ under the Artistic-2.0 license. Supplementary information: We provide tutorial articles as well as several example scripts on the project’s website.

Download Full-text

Improving performance of Python code using rewriting rules technique

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2020.02-03.115 ◽

2020 ◽

pp. 115-125

Author(s):

K.A. Zhereb ◽

Keyword(s):

Algebraic Model ◽

Algebraic Models ◽

Data Types ◽

Low Level ◽

Additional Information ◽

Program Comparison ◽

Level Model ◽

Automatic Tool ◽

High Level ◽

Rewriting Rules

Python is a popular programming language used in many areas, but its performance is significantly lower than many compiled languages. We propose an approach to increasing performance of Python code by transforming fragments of code to more efficient languages such as Cython and C++. We use high-level algebraic models and rewriting rules technique for semi-automated code transformation. Performance-critical fragments of code are transformed into a low-level syntax model using Python parser. Then this low-level model is further transformed into a high-level algebraic model that is language-independent and easier to work with. The transformation is automated using rewriting rules implemented in Termware system. We also improve the constructed high-level model by deducing additional information such as data types and constraints. From this enhanced high-level model of code we generate equivalent fragments of code using code generators for Cython and C++ languages. Cython code is seamlessly integrated with Python code, and for C++ code we generate a small utility file in Cython that also integrates this code with Python. This way, the bulk of program code can stay in Python and benefit from its facilities, but performance-critical fragments of code are transformed into more efficient equivalents, improving the performance of resulting program. Comparison of execution times between initial version of Python code, different versions of transformed code and using automatic tools such as Cython compiler and PyPy demonstrates the benefits of our approach – we have achieved performance gains of over 50x compared to the initial version written in Python, and over 2x compared to the best automatic tool we have tested.

Download Full-text

Megadepth: efficient coverage quantification for BigWigs and BAMs

10.1101/2020.12.17.423317 ◽

2020 ◽

Author(s):

Christopher Wilks ◽

Omar Ahmed ◽

Daniel N. Baker ◽

David Zhang ◽

Leonardo Collado-Torres ◽

...

Keyword(s):

Gene Annotation ◽

Command Line ◽

Bioconductor Package ◽

Input File ◽

Link Type ◽

Command Line Tool

AbstractMotivationA common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types.ResultsMegadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19,000 GTExV8 BigWig files in approximately one hour using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package.Availabilityhttps://github.com/ChristopherWilks/megadepth, https://bioconductor.org/packages/[email protected]

Download Full-text

treedata.table: a wrapper for data.table that enables fast manipulation of large phylogenetic trees matched to data

PeerJ ◽

10.7717/peerj.12450 ◽

2021 ◽

Vol 9 ◽

pp. e12450

Author(s):

Cristian Román Palacios ◽

April Wright ◽

Josef Uyeda

Keyword(s):

Next Generation Sequencing ◽

Phylogenetic Trees ◽

Next Generation ◽

Data Repositories ◽

Data Manipulation ◽

Link Type ◽

Recent Advances ◽

Phylogenetic Data ◽

Public Data ◽

Generation Sequencing

The number of terminals in phylogenetic trees has significantly increased over the last decade. This trend reflects recent advances in next-generation sequencing, accessibility of public data repositories, and the increased use of phylogenies in many fields. Despite R being central to the analysis of phylogenetic data, manipulation of phylogenetic comparative datasets remains slow, complex, and poorly reproducible. Here, we describe the first R package extending the functionality and syntax of data.table to explicitly deal with phylogenetic comparative datasets. treedata.table significantly increases speed and reproducibility during the data manipulation steps involved in the phylogenetic comparative workflow in R. The latest release of treedata.table is currently available through CRAN (https://cran.r-project.org/web/packages/treedata.table/). Additional documentation can be accessed through rOpenSci (https://ropensci.github.io/treedata.table/).

Download Full-text

On the automatic annotation of gene functions using observational data and phylogenetic trees

10.1101/2020.05.14.095687 ◽

2020 ◽

Author(s):

George G. Vega Yon ◽

Duncan C. Thomas ◽

John Morrison ◽

Huaiyu Mi ◽

Paul D. Thomas ◽

...

Keyword(s):

Gene Function ◽

Phylogenetic Trees ◽

Evolutionary Model ◽

Computational Prediction ◽

Gene Families ◽

R Package ◽

Biomedical Sciences ◽

Computationally Efficient ◽

Link Type ◽

Gene Functions

AbstractMotivationGene function annotation is important for a variety of downstream analyses of genetic data. Yet experimental characterization of function remains costly and slow, making computational prediction an important endeavor. In this paper we use a probabilistic evolutionary model built upon phylogenetic trees and experimental Gene Ontology functional annotations that allows automated prediction of function for unannotated genes.ResultsWe have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out validation, and we further validated some of the predictions in the experimental scientific literature.AvailabilityOur method has been implemented as an R package and it is available online at https://github.com/USCBiostats/aphylo. Code needed to reproduce the tables and figures can be found in https://github.com/USCbiostats/aphylo-simulations.Author summaryUnderstanding the individual role that genes play in life is a key issue in biomedical-sciences. While information regarding gene functions is continuously growing, the number of genes with unknown biological purpose is yet greater. Because of this, scientists have dedicated much of their time to build and design tools that automatically infer gene functions. In this paper, we present yet another attempt to do such. While very simple, our model of gene-function evolution has some key features that have the potential to generate an impact in the field: (a) compared to other methods, ours is highly-scalable, which means that it is possible to simultaneously analyze hundreds of what are known as gene-families, compromising thousands of genes, (b) supports our biological intuition as our model’s data-driven results coherently agree with what theory dictates regarding how gene-functions evolved, (c) notwithstanding its simplicity, the model’s prediction accuracy is comparable to other more complex alternatives, and (d) perhaps most importantly, it can be used to both support new annotations and to suggest areas in which existing annotations show inconsistencies that may indicate errors or controversies.

Download Full-text

Easily phylotyping E. coli via the EzClermont web app and command-line tool

10.1101/317610 ◽

2018 ◽

Cited By ~ 3

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

Supplementary Information ◽

Validation Dataset ◽

Command Line ◽

E Coli ◽

Link Type ◽

Command Line Tool ◽

Pcr Method ◽

Web App ◽

Local Use ◽

Genome Assemblies

SummaryThe Clermont PCR method of phylotyping Escherichia coli has remained a useful classification scheme despite the proliferation of higher-resolution sequence typing schemes. We have implemented an in silico Clermont PCR method as both a web app and as a command-line tool to allow researchers to easily apply this phylotyping scheme to genome assemblies easily.Availability and ImplementationEzClermont is available as a web app at http://www.ezclermont.org. For local use, EzClermont can be installed with pip or installed from the source code at https://github.com/nickp60/ezclermont. All analysis was done with version [email protected], [email protected] informationTable S1: test dataset; S2: validation dataset; S3: results.

Download Full-text

Easy phylotyping of Escherichia coli via the EzClermont web app and command-line tool

Access Microbiology ◽

10.1099/acmi.0.000143 ◽

2020 ◽

Vol 2 (9) ◽

Cited By ~ 2

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

Escherichia Coli ◽

Type Species ◽

Whole Genome ◽

Command Line ◽

Content Type ◽

Link Type ◽

Command Line Tool ◽

Pcr Method ◽

Web App ◽

Genome Assemblies

The Clermont PCR method for phylotyping Escherichia coli remains a useful classification scheme even though genome sequencing is now routine, and higher-resolution sequence typing schemes are now available. Relating present-day whole-genome E. coli classifications to legacy phylotyping is essential for harmonizing the historical literature and understanding of this important organism. Therefore, we present EzClermont – a novel in silico Clermont PCR phylotyping tool to enable ready application of this phylotyping scheme to whole-genome assemblies. We evaluate this tool against phylogenomic classifications, and an alternative software implementation of Clermont typing. EzClermont is available as a web app at www.ezclermont.org, and as a command-line tool at https://nickp60.github.io/EzClermont/.

Download Full-text