PyRanges: efficient comparison of genomic intervals in Python

Bioinformatics ◽

10.1093/bioinformatics/btz615 ◽

2019 ◽

Cited By ~ 2

Author(s):

Endre Bakken Stovner ◽

Pål Sætrom

Keyword(s):

Data Structure ◽

Supplementary Information ◽

Supplementary Data ◽

Genomic Libraries ◽

Simple Set ◽

Set Operations ◽

Wide Range ◽

Genomic Analyses ◽

Associated Data ◽

Memory Efficient

Abstract Summary Complex genomic analyses often use sequences of simple set operations like intersection, overlap and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single threaded on binary set operations, PyRanges is in median 2.3–9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python. Availability and implementation PyRanges is available as open source under the MIT license at https://github.com/biocore-NTNU/pyranges and the documentation exists at https://biocore-NTNU.github.io/pyranges/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyRanges: efficient comparison of genomic intervals in Python

10.1101/609396 ◽

2019 ◽

Cited By ~ 1

Author(s):

Endre Bakken Stovner ◽

Pål Sætrom

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Genomic Libraries ◽

Link Type ◽

Simple Set ◽

Set Operations ◽

Wide Range ◽

Genomic Analyses ◽

Associated Data ◽

Memory Efficient

AbstractSummaryComplex genomic analyses often use sequences of simple set operations like intersection, overlap, and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single-threaded on binary set operations, PyRanges is in median 2.3-9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python.AvailabilityPyRanges is available open-source under the MIT license at https://github.com/biocore-NTNU/pyranges and documentation exists at https://biocore-NTNU.github.io/pyranges/[email protected] informationSupplementary data are available.

Download Full-text

Epidemiological modeling in StochSS Live!

Bioinformatics ◽

10.1093/bioinformatics/btab061 ◽

2021 ◽

Author(s):

Richard Jiang ◽

Bruno Jacob ◽

Matthew Geiger ◽

Sean Matthew ◽

Bryan Rumsey ◽

...

Keyword(s):

Stochastic Model ◽

Epidemiological Model ◽

Supplementary Information ◽

Supplementary Data ◽

Web Based ◽

Epidemiological Modeling ◽

Modeling Simulation ◽

Wide Range ◽

Biochemical Systems

Abstract Summary We present StochSS Live!, a web-based service for modeling, simulation and analysis of a wide range of mathematical, biological and biochemical systems. Using an epidemiological model of COVID-19, we demonstrate the power of StochSS Live! to enable researchers to quickly develop a deterministic or a discrete stochastic model, infer its parameters and analyze the results. Availability and implementation StochSS Live! is freely available at https://live.stochss.org/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DrawGlycan-SNFG and gpAnnotate: rendering glycans and annotating glycopeptide mass spectra

Bioinformatics ◽

10.1093/bioinformatics/btz819 ◽

2019 ◽

Cited By ~ 4

Author(s):

Kai Cheng ◽

Gabrielle Pawlowski ◽

Xinheng Yu ◽

Yusen Zhou ◽

Sriram Neelamegham

Keyword(s):

Mass Spectrometry ◽

Open Source ◽

Mass Spectra ◽

Supplementary Information ◽

Supplementary Data ◽

International Union ◽

Open Source Program ◽

Source Program ◽

Wide Range ◽

Peptide Modifications

Abstract Summary This manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure and Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry (MS) applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes. Availability and implementation DrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A fast and memory-efficient implementation of the transfer bootstrap

Bioinformatics ◽

10.1093/bioinformatics/btz874 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2280-2281 ◽

Cited By ~ 2

Author(s):

Sarah Lutteropp ◽

Alexey M Kozlov ◽

Alexandros Stamatakis

Keyword(s):

General Public ◽

Efficient Implementation ◽

Supplementary Information ◽

Bootstrap Support ◽

Supplementary Data ◽

Original Algorithm ◽

Parallel Version ◽

Branch Support ◽

General Public License ◽

Memory Efficient

Abstract Motivation Recently, Lemoine et al. suggested the transfer bootstrap expectation (TBE) branch support metric as an alternative to classical phylogenetic bootstrap support for taxon-rich datasets. However, the original TBE implementation in the booster tool is compute- and memory-intensive. Results We developed a fast and memory-efficient TBE implementation. We improve upon the original algorithm by Lemoine et al. via several algorithmic and technical optimizations. On empirical as well as on random tree sets with varying taxon counts, our implementation is up to 480 times faster than booster. Furthermore, it only requires memory that is linear in the number of taxa, which leads to 10× to 40× memory savings compared with booster. Availability and implementation Our implementation has been partially integrated into pll-modules and RAxML-NG and is available under the GNU Affero General Public License v3.0 at https://github.com/ddarriba/pll-modules and https://github.com/amkozlov/raxml-ng. The parallel version that also computes additional TBE-related statistics is available at: https://github.com/lutteropp/raxml-ng/tree/tbe. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Kmer-db: instant evolutionary distance estimation

10.1101/263590 ◽

2018 ◽

Author(s):

Sebastian Deorowicz ◽

Adam Gudys ◽

Maciej Dlugosz ◽

Marek Kokot ◽

Agnieszka Danek

Keyword(s):

Data Structure ◽

Web Site ◽

Parallel Implementation ◽

Evolutionary Relationship ◽

Distance Estimation ◽

Evolutionary Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Efficient Data ◽

Evolutionary Distance Estimation

AbstractSummaryKmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40,715 pathogens in less than 4 minutes (on a modern workstation), 44 times faster than Mash, its main competitor.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site

Download Full-text

ensembldb: an R package to create and use Ensembl-based annotation resources

Bioinformatics ◽

10.1093/bioinformatics/btz031 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3151-3153 ◽

Cited By ~ 13

Author(s):

Johannes Rainer ◽

Laurent Gatto ◽

Christian X Weichenberger

Keyword(s):

Coordinate System ◽

Positional Information ◽

R Package ◽

Research Data ◽

Supplementary Information ◽

Reproducible Research ◽

Bioconductor Package ◽

Supplementary Data ◽

Reference Coordinate System ◽

Associated Data

Abstract Summary Bioinformatics research frequently involves handling gene-centric data such as exons, transcripts, proteins and their positions relative to a reference coordinate system. The ensembldb Bioconductor package retrieves and stores Ensembl-based genetic annotations and positional information, and furthermore offers identifier conversion and coordinates mappings for gene-associated data. In support of reproducible research, data are tied to Ensembl releases and are kept separately from the software. Premade data packages are available for a variety of genomes and Ensembl releases. Three examples demonstrate typical use cases of this software. Availability and implementation ensembldb is part of Bioconductor (https://bioconductor.org/packages/ensembldb). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

Bioinformatics ◽

10.1093/bioinformatics/btaa676 ◽

2020 ◽

Author(s):

Simone Ciccolella ◽

Giulia Bernardini ◽

Luca Denti ◽

Paola Bonizzoni ◽

Marco Previtali ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

Similarity Measures ◽

Real Data ◽

Similarity Score ◽

Supplementary Information ◽

Supplementary Data ◽

Wide Range ◽

Golden Standard ◽

History Of

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TRTools: a toolkit for genome-wide analysis of tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa736 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nima Mousavi ◽

Jonathan Margoliash ◽

Neha Pusarla ◽

Shubham Saini ◽

Richard Yanicky ◽

...

Keyword(s):

Quality Control ◽

Tandem Repeats ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Genome Wide Analysis ◽

Genome Wide ◽

Wide Range ◽

Downstream Analysis

Abstract Summary A rich set of tools have recently been developed for performing genome-wide genotyping of tandem repeats (TRs). However, standardized tools for downstream analysis of these results are lacking. To facilitate TR analysis applications, we present TRTools, a Python library and suite of command line tools for filtering, merging and quality control of TR genotype files. TRTools utilizes an internal harmonization module, making it compatible with outputs from a wide range of TR genotypers. Availability and implementation TRTools is freely available at https://github.com/gymreklab/TRTools. Detailed documentation is available at https://trtools.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PSiTE: a Phylogeny guided Simulator for Tumor Evolution

Bioinformatics ◽

10.1093/bioinformatics/btz028 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3148-3150 ◽

Cited By ~ 2

Author(s):

Hechuan Yang ◽

Bingxin Lu ◽

Lan Huong Lai ◽

Abner Herbert Lim ◽

Jacob Josiah Santiago Alvarez ◽

...

Keyword(s):

Cancer Genomics ◽

Clonal Evolution ◽

Cell Tumor ◽

Supplementary Information ◽

Tumor Evolution ◽

Supplementary Data ◽

Efficient Tool ◽

Wide Range ◽

Different Types ◽

Evolutionary Trajectories

Abstract Summary Simulating realistic clonal dynamics of tumors is an important topic in cancer genomics. Here, we present Phylogeny guided Simulator for Tumor Evolution, a tool that can simulate different types of tumor samples including single sector, multi-sector bulk tumor as well as single-cell tumor data under a wide range of evolutionary trajectories. Phylogeny guided Simulator for Tumor Evolution provides an efficient tool for understanding clonal evolution of cancer. Availability and implementation PSiTE is implemented in Python and is available at https://github.com/hchyang/PSiTE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text