A fast and memory-efficient implementation of the transfer bootstrap

Sarah Lutteropp; Alexey M Kozlov; Alexandros Stamatakis

doi:10.1093/bioinformatics/btz874

A fast and memory-efficient implementation of the transfer bootstrap

Bioinformatics ◽

10.1093/bioinformatics/btz874 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2280-2281 ◽

Cited By ~ 2

Author(s):

Sarah Lutteropp ◽

Alexey M Kozlov ◽

Alexandros Stamatakis

Keyword(s):

General Public ◽

Efficient Implementation ◽

Supplementary Information ◽

Bootstrap Support ◽

Supplementary Data ◽

Original Algorithm ◽

Parallel Version ◽

Branch Support ◽

General Public License ◽

Memory Efficient

Abstract Motivation Recently, Lemoine et al. suggested the transfer bootstrap expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support for taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Results We developed a fast and memory-efficient TBE implementation. We improve upon the original algorithm by Lemoine et al. via several algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× to 40× memory savings compared with booster. Availability and implementation Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallel version that also computes additional TBE-related statistics is available at: https://github.com/lutteropp/raxml-ng/tree/tbe. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Fast and Memory-Efficient Implementation of the Transfer Bootstrap

10.1101/734848 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sarah Lutteropp ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis

Keyword(s):

General Public ◽

Efficient Implementation ◽

Random Tree ◽

Bootstrap Support ◽

Original Algorithm ◽

Link Type ◽

Branch Support ◽

General Public License ◽

Memory Efficient

AbstractRecently, Lemoine et al. suggested the Transfer Bootstrap Expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support metric on taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Therefore, we developed a fast and memory-efficient TBE implementation. We improved upon the original algorithm described by Lemoine et al. by introducing multiple algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× - 40× memory savings compared to booster. Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallelized version that also computes additional TBE-related statistics is available in pll-modules and RAxML-NG forks at: https://github.com/lutteropp/pll-modules/tree/tbe and https://github.com/lutteropp/raxml-ng/tree/tbe.

Download Full-text

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

Bioinformatics ◽

10.1093/bioinformatics/btz848 ◽

2019 ◽

Cited By ~ 161

Author(s):

Pierre-Alain Chaumeil ◽

Aaron J Mussig ◽

Philip Hugenholtz ◽

Donovan H Parks

Keyword(s):

General Public ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Computationally Efficient ◽

Taxonomic Assignments ◽

General Public License

Abstract Summary The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. Availability GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ARBitR: an overlap-aware genome assembly scaffolder for linked reads

Bioinformatics ◽

10.1093/bioinformatics/btaa975 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Genomic Sequencing ◽

Supplementary Data ◽

Genome Assemblies ◽

General Public License

Abstract Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BIOLITMAP: a web-based geolocated, temporal and thematic visualization of the evolution of bioinformatics publications

Bioinformatics ◽

10.1093/bioinformatics/bty967 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2518-2520

Author(s):

Adrián Bazaga ◽

Alfonso Valencia ◽

María- JoséRementeria

Keyword(s):

General Public ◽

Fast Growth ◽

Supplementary Information ◽

Supplementary Data ◽

Web Based ◽

Research Publications

Abstract Motivation The fast growth of bioinformatics adds a significant difficulty to assess the contribution, geographical and thematic distribution of the research publications. Results To help researchers, grant agencies and general public to assess the progress in bioinformatics, we have developed BIOLITMAP, a web-based geolocation system that allows an easy and sensible exploration of the publications by institution, year and topic. Availability and implementation BIOLITMAP is available at http://socialanalytics.bsc.es/biolitmap and the sources have been deposited at https://github.com/inab/BIOLITMAP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

Issues of Z-factor and an approach to avoid them for quality control in high-throughput screening studies

Bioinformatics ◽

10.1093/bioinformatics/btaa1049 ◽

2020 ◽

Author(s):

Xiaohua Douglas Zhang ◽

Dandan Wang ◽

Shixue Sun ◽

Heping Zhang

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Screening ◽

Theoretical Basis ◽

Sampling Error ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Automation Technology ◽

General Public License

Abstract Motivation High-throughput screening (HTS) is a vital automation technology in biomedical research in both industry and academia. The well-known Z-factor has been widely used as a gatekeeper to assure assay quality in an HTS study. However, many researchers and users may not have realized that Z-factor has major issues. Results In this article, the following four major issues are explored and demonstrated so that researchers may use the Z-factor appropriately. First, the Z-factor violates the Pythagorean theorem of statistics. Second, there is no adjustment of sampling error in the application of the Z-factor for quality control (QC) in HTS studies. Third, the expectation of the sample-based Z-factor does not exist. Fourth, the thresholds in the Z-factor-based criterion lack a theoretical basis. Here, an approach to avoid these issues was proposed and new QC criteria under homoscedasticity were constructed so that researchers can choose a statistically grounded criterion for QC in the HTS studies. We implemented this approach in an R package and demonstrated its utility in multiple CRISPR/CAS9 or siRNA HTS studies. Availability and implementation The R package qcSSMDhomo is freely available from GitHub: https://github.com/Karena6688/qcSSMDhomo. The file qcSSMDhomo_1.0.0.tar.gz (for Windows) containing qcSSMDhomo is also available at Bioinformatics online. qcSSMDhomo is distributed under the GNU General Public License. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyRanges: efficient comparison of genomic intervals in Python

10.1101/609396 ◽

2019 ◽

Cited By ~ 1

Author(s):

Endre Bakken Stovner ◽

Pål Sætrom

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Genomic Libraries ◽

Link Type ◽

Simple Set ◽

Set Operations ◽

Wide Range ◽

Genomic Analyses ◽

Associated Data ◽

Memory Efficient

AbstractSummaryComplex genomic analyses often use sequences of simple set operations like intersection, overlap, and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single-threaded on binary set operations, PyRanges is in median 2.3-9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python.AvailabilityPyRanges is available open-source under the MIT license at https://github.com/biocore-NTNU/pyranges and documentation exists at https://biocore-NTNU.github.io/pyranges/[email protected] informationSupplementary data are available.

Download Full-text

polyDFEv2.0: testing for invariance of the distribution of fitness effects within and across species

Bioinformatics ◽

10.1093/bioinformatics/bty1060 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2868-2869 ◽

Cited By ~ 10

Author(s):

Paula Tataru ◽

Thomas Bataillon

Keyword(s):

Source Code ◽

Likelihood Ratio Tests ◽

Supplementary Information ◽

Supplementary Data ◽

Post Processing ◽

Fitness Effects ◽

Site Frequency Spectrum ◽

Genomic Regions ◽

General Public License ◽

R Functions

Abstract Summary Distribution of fitness effects (DFE) of mutations can be inferred from site frequency spectrum (SFS) data. There is mounting interest to determine whether distinct genomic regions and/or species share a common DFE, or whether evidence exists for differences among them. polyDFEv2.0 fits multiple SFS datasets at once and provides likelihood ratio tests for DFE invariance across datasets. Simulations show that testing for DFE invariance across genomic regions within a species requires models accounting for distinct sources of heterogeneity (chance and genuine difference in DFE) underlying differences in SFS data in these regions. Not accounting for this will result in the spurious detection of DFE differences. Availability and Implementation polyDFEv2.0 is implemented in C and is accompanied by a series of R functions that facilitate post-processing of the output. It is available as source code and compiled binaries under a GNU General Public License v3.0 from https://github.com/paula-tataru/polyDFE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyRanges: efficient comparison of genomic intervals in Python

Bioinformatics ◽

10.1093/bioinformatics/btz615 ◽

2019 ◽

Cited By ~ 2

Author(s):

Endre Bakken Stovner ◽

Pål Sætrom

Keyword(s):

Data Structure ◽

Supplementary Information ◽

Supplementary Data ◽

Genomic Libraries ◽

Simple Set ◽

Set Operations ◽

Wide Range ◽

Genomic Analyses ◽

Associated Data ◽

Memory Efficient

Abstract Summary Complex genomic analyses often use sequences of simple set operations like intersection, overlap and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single threaded on binary set operations, PyRanges is in median 2.3–9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python. Availability and implementation PyRanges is available as open source under the MIT license at https://github.com/biocore-NTNU/pyranges and the documentation exists at https://biocore-NTNU.github.io/pyranges/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ARBitR: An overlap-aware genome assembly scaffolder for linked reads

10.1101/2020.04.29.065847 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Long Read ◽

Genome Assemblies ◽

General Public License

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online

Download Full-text