scholarly journals A fast and memory-efficient implementation of the transfer bootstrap

2019 ◽  
Vol 36 (7) ◽  
pp. 2280-2281 ◽  
Author(s):  
Sarah Lutteropp ◽  
Alexey M Kozlov ◽  
Alexandros Stamatakis

Abstract Motivation Recently, Lemoine et al. suggested the transfer bootstrap expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support for taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Results We developed a fast and memory-efficient TBE implementation. We improve upon the original algorithm by Lemoine et al. via several algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× to 40× memory savings compared with booster. Availability and implementation Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallel version that also computes additional TBE-related statistics is available at: https://github.com/lutteropp/raxml-ng/tree/tbe. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Sarah Lutteropp ◽  
Alexey M. Kozlov ◽  
Alexandros Stamatakis

AbstractRecently, Lemoine et al. suggested the Transfer Bootstrap Expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support metric on taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Therefore, we developed a fast and memory-efficient TBE implementation. We improved upon the original algorithm described by Lemoine et al. by introducing multiple algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× - 40× memory savings compared to booster. Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallelized version that also computes additional TBE-related statistics is available in pll-modules and RAxML-NG forks at: https://github.com/lutteropp/pll-modules/tree/tbe and https://github.com/lutteropp/raxml-ng/tree/tbe.


Author(s):  
Pierre-Alain Chaumeil ◽  
Aaron J Mussig ◽  
Philip Hugenholtz ◽  
Donovan H Parks

Abstract Summary The GTDB Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB). GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10,156 bacterial and archaeal metagenome-assembled genomes. Availability GTDB-Tk is implemented in Python and licensed under the GNU General Public License v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Markus Hiltunen ◽  
Martin Ryberg ◽  
Hanna Johannesson

Abstract Summary Linked genomic sequencing reads contain information that can be used to join sequences together into scaffolds in draft genome assemblies. Existing software for this purpose performs the scaffolding by joining sequences with a gap between them, not considering potential overlaps of contigs. We developed ARBitR to create scaffolds where overlaps are taken into account and show that it can accurately recreate regions where draft assemblies are broken. Availability and implementation ARBitR is written and implemented in Python3 for Unix-based operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (14) ◽  
pp. 2518-2520
Author(s):  
Adrián Bazaga ◽  
Alfonso Valencia ◽  
María- JoséRementeria

Abstract Motivation The fast growth of bioinformatics adds a significant difficulty to assess the contribution, geographical and thematic distribution of the research publications. Results To help researchers, grant agencies and general public to assess the progress in bioinformatics, we have developed BIOLITMAP, a web-based geolocation system that allows an easy and sensible exploration of the publications by institution, year and topic. Availability and implementation BIOLITMAP is available at http://socialanalytics.bsc.es/biolitmap and the sources have been deposited at https://github.com/inab/BIOLITMAP. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Robert J. Vickerstaff ◽  
Richard J. Harrison

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.


Author(s):  
Xiaohua Douglas Zhang ◽  
Dandan Wang ◽  
Shixue Sun ◽  
Heping Zhang

Abstract Motivation High-throughput screening (HTS) is a vital automation technology in biomedical research in both industry and academia. The well-known Z-factor has been widely used as a gatekeeper to assure assay quality in an HTS study. However, many researchers and users may not have realized that Z-factor has major issues. Results In this article, the following four major issues are explored and demonstrated so that researchers may use the Z-factor appropriately. First, the Z-factor violates the Pythagorean theorem of statistics. Second, there is no adjustment of sampling error in the application of the Z-factor for quality control (QC) in HTS studies. Third, the expectation of the sample-based Z-factor does not exist. Fourth, the thresholds in the Z-factor-based criterion lack a theoretical basis. Here, an approach to avoid these issues was proposed and new QC criteria under homoscedasticity were constructed so that researchers can choose a statistically grounded criterion for QC in the HTS studies. We implemented this approach in an R package and demonstrated its utility in multiple CRISPR/CAS9 or siRNA HTS studies. Availability and implementation The R package qcSSMDhomo is freely available from GitHub: https://github.com/Karena6688/qcSSMDhomo. The file qcSSMDhomo_1.0.0.tar.gz (for Windows) containing qcSSMDhomo is also available at Bioinformatics online. qcSSMDhomo is distributed under the GNU General Public License. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Endre Bakken Stovner ◽  
Pål Sætrom

AbstractSummaryComplex genomic analyses often use sequences of simple set operations like intersection, overlap, and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single-threaded on binary set operations, PyRanges is in median 2.3-9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python.AvailabilityPyRanges is available open-source under the MIT license at https://github.com/biocore-NTNU/pyranges and documentation exists at https://biocore-NTNU.github.io/pyranges/[email protected] informationSupplementary data are available.


2019 ◽  
Vol 35 (16) ◽  
pp. 2868-2869 ◽  
Author(s):  
Paula Tataru ◽  
Thomas Bataillon

Abstract Summary Distribution of fitness effects (DFE) of mutations can be inferred from site frequency spectrum (SFS) data. There is mounting interest to determine whether distinct genomic regions and/or species share a common DFE, or whether evidence exists for differences among them. polyDFEv2.0 fits multiple SFS datasets at once and provides likelihood ratio tests for DFE invariance across datasets. Simulations show that testing for DFE invariance across genomic regions within a species requires models accounting for distinct sources of heterogeneity (chance and genuine difference in DFE) underlying differences in SFS data in these regions. Not accounting for this will result in the spurious detection of DFE differences. Availability and Implementation polyDFEv2.0 is implemented in C and is accompanied by a series of R functions that facilitate post-processing of the output. It is available as source code and compiled binaries under a GNU General Public License v3.0 from https://github.com/paula-tataru/polyDFE. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Endre Bakken Stovner ◽  
Pål Sætrom

Abstract Summary Complex genomic analyses often use sequences of simple set operations like intersection, overlap and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single threaded on binary set operations, PyRanges is in median 2.3–9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python. Availability and implementation PyRanges is available as open source under the MIT license at https://github.com/biocore-NTNU/pyranges and the documentation exists at https://biocore-NTNU.github.io/pyranges/ Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Markus Hiltunen ◽  
Martin Ryberg ◽  
Hanna Johannesson

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online


Sign in / Sign up

Export Citation Format

Share Document