PEWO: a collection of workflows to benchmark phylogenetic placement

Bioinformatics ◽

10.1093/bioinformatics/btaa657 ◽

2020 ◽

Cited By ~ 1

Author(s):

Benjamin Linard ◽

Nikolai Romashchenko ◽

Fabio Pardi ◽

Eric Rivals

Keyword(s):

Parameter Optimization ◽

Genomic Data ◽

Supplementary Information ◽

Taxonomic Identification ◽

Supplementary Data ◽

Phylogenetic Placement ◽

Future Developments ◽

Community Effort ◽

Standard Support ◽

Selection Of

Abstract Motivation Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed Placement Evaluation WOrkflows (PEWO), the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimization for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard support for future developments and applications of PP. Availability and implementation https://github.com/phylo42/PEWO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PINSPlus: a tool for tumor subtype discovery in integrated genomic data

Bioinformatics ◽

10.1093/bioinformatics/bty1049 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2843-2846 ◽

Cited By ~ 15

Author(s):

Hung Nguyen ◽

Sangam Shrestha ◽

Sorin Draghici ◽

Tin Nguyen

Keyword(s):

Personal Computer ◽

Genomic Data ◽

Supplementary Information ◽

Omics Data ◽

Tumor Subtype ◽

Supplementary Data ◽

Significant Survival ◽

Survival Differences

Abstract Summary Since cancer is a heterogeneous disease, tumor subtyping is crucial for improved treatment and prognosis. We have developed a subtype discovery tool, called PINSPlus, that is: (i) robust against noise and unstable quantitative assays, (ii) able to integrate multiple types of omics data in a single analysis and (iii) dramatically superior to established approaches in identifying known subtypes and novel subgroups with significant survival differences. Our validation on 12,158 samples from 44 datasets shows that PINSPlus vastly outperforms other approaches. The software is easy-to-use and can partition hundreds of patients in a few minutes on a personal computer. Availability and implementation The package is available at https://cran.r-project.org/package=PINSPlus. Data and R script used in this manuscript are available at https://bioinformatics.cse.unr.edu/software/PINSPlus/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data

Bioinformatics ◽

10.1093/bioinformatics/btaa070 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3263-3265 ◽

Cited By ~ 14

Author(s):

Lucas Czech ◽

Pierre Barbera ◽

Alexandros Stamatakis

Keyword(s):

Phylogenetic Trees ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Computationally Efficient ◽

Data Types ◽

Low Level ◽

Phylogenetic Placement ◽

Command Line Tool ◽

High Level

Abstract Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

tugHall: a simulator of cancer-cell evolution based on the hallmarks of cancer and tumor-related genes

Bioinformatics ◽

10.1093/bioinformatics/btaa182 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3597-3599 ◽

Cited By ~ 1

Author(s):

Iurii S Nagornov ◽

Mamoru Kato

Keyword(s):

Cancer Cell ◽

Tumor Heterogeneity ◽

Clonal Evolution ◽

Source Code ◽

Genomic Data ◽

Supplementary Information ◽

Cell Behavior ◽

Supplementary Data ◽

Hallmarks Of Cancer ◽

Cell Evolution

Abstract Summary The flood of recent cancer genomic data requires a coherent model that can sort out the findings to systematically explain clonal evolution and the resultant intra-tumor heterogeneity (ITH). Here, we present a new mathematical model designed to computationally simulate the evolution of cancer cells. The model connects the well-known hallmarks of cancer with the specific mutational states of tumor-related genes. The cell behavior phenotypes are stochastically determined, and the hallmarks probabilistically interfere with the phenotypic probabilities. In turn, the hallmark variables depend on the mutational states of tumor-related genes. Thus, our software can deepen our understanding of cancer-cell evolution and generation of ITH. Availability and implementation The open-source code is available in the repository https://github.com/nagornovys/Cancer_cell_evolution. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

pyGenomeTracks: reproducible plots for multivariate genomic data sets

Bioinformatics ◽

10.1093/bioinformatics/btaa692 ◽

2020 ◽

Cited By ~ 7

Author(s):

Lucille Lopez-Delisle ◽

Leily Rabbani ◽

Joachim Wolff ◽

Vivek Bhardwaj ◽

Rolf Backofen ◽

...

Keyword(s):

Genomic Data ◽

Supplementary Information ◽

Data Sets ◽

Command Line ◽

Graphical Interface ◽

Supplementary Data ◽

Considerable Effort ◽

Vector Graphic ◽

Graphic Software

Abstract Motivation Generating publication ready plots to display multiple genomic tracks can pose a serious challenge. Making desirable and accurate figures requires considerable effort. This is usually done by hand or by using a vector graphic software. Results pyGenomeTracks (PGT) is a modular plotting tool that easily combines multiple tracks. It enables a reproducible and standardized generation of highly customizable and publication ready images. Availability PGT is available through a graphical interface on https://usegalaxy.eu and through the command line. It is provided on conda via the bioconda channel, on pip and it is openly developed on github: https://github.com/deeptools/pyGenomeTracks. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

dbgap2x: An R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

Bioinformatics ◽

10.1093/bioinformatics/btz680 ◽

2019 ◽

Cited By ~ 1

Author(s):

Grégoire Versmée ◽

Laura Versmée ◽

Mikaël Dusenne ◽

Niloofar Jalali ◽

Paul Avillach

Keyword(s):

Data Sharing ◽

Large Scale ◽

Genomic Data ◽

R Package ◽

National Institutes Of Health ◽

Supplementary Information ◽

Supplementary Data ◽

Complex Procedure ◽

Range Of Functions ◽

The Relationship

Abstract Summary Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1,000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. Availability and implementation dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Rapid screening and detection of inter-type viral recombinants using phylo-k-mers

Bioinformatics ◽

10.1093/bioinformatics/btaa1020 ◽

2020 ◽

Author(s):

Guillaume E Scholz ◽

Benjamin Linard ◽

Nikolai Romashchenko ◽

Eric Rivals ◽

Fabio Pardi

Keyword(s):

Source Code ◽

Rapid Screening ◽

Supplementary Information ◽

Evolutionary Significance ◽

Supplementary Data ◽

Large Database ◽

Recombinant Viruses ◽

Phylogenetic Placement ◽

Whole Genomes ◽

Complex Models

Abstract Motivation Novel recombinant viruses may have important medical and evolutionary significance, as they sometimes display new traits not present in the parental strains. This is particularly concerning when the new viruses combine fragments coming from phylogenetically distinct viral types. Here, we consider the task of screening large collections of sequences for such novel recombinants. A number of methods already exist for this task. However, these methods rely on complex models and heavy computations that are not always practical for a quick scan of a large number of sequences. Results We have developed SHERPAS, a new program to detect novel recombinants and provide a first estimate of their parental composition. Our approach is based on the precomputation of a large database of ‘phylogenetically-informed k-mers’, an idea recently introduced in the context of phylogenetic placement in metagenomics. Our experiments show that SHERPAS is hundreds to thousands of times faster than existing software, and enables the analysis of thousands of whole genomes, or long-sequencing reads, within minutes or seconds, and with limited loss of accuracy. Availability and implementation The source code is freely available for download at https://github.com/phylo42/sherpas. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MsPAC: a tool for haplotype-phased structural variant detection

Bioinformatics ◽

10.1093/bioinformatics/btz618 ◽

2019 ◽

Vol 36 (3) ◽

pp. 922-924 ◽

Cited By ~ 3

Author(s):

Oscar L Rodriguez ◽

Anna Ritz ◽

Andrew J Sharp ◽

Ali Bashir

Keyword(s):

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Structural Variant ◽

Long Read ◽

One Step ◽

Variant Detection ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Summary While next-generation sequencing (NGS) has dramatically increased the availability of genomic data, phased genome assembly and structural variant (SV) analyses are limited by NGS read lengths. Long-read sequencing from Pacific Biosciences and NGS barcoding from 10x Genomics hold the potential for far more comprehensive views of individual genomes. Here, we present MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software) and convert assemblies into high-quality, phased SV predictions. MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved, diploid genomes. Availability and implementation https://github.com/oscarlr/MsPAC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

hts-nim: scripting high-performance genomic analyses

10.1101/261735 ◽

2018 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

High Performance ◽

Genomic Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scripting Languages ◽

Link Type ◽

Custom Software ◽

Genomic Analyses ◽

Biological Insight ◽

Supplementary Material

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

GABAC: an arithmetic coding solution for genomic data

Bioinformatics ◽

10.1093/bioinformatics/btz922 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2275-2277 ◽

Cited By ~ 1

Author(s):

Jan Voges ◽

Tom Paridaens ◽

Fabian Müntefering ◽

Liudmila S Mainzer ◽

Brian Bliss ◽

...

Keyword(s):

Genomic Data ◽

International Organization ◽

Arithmetic Coding ◽

Supplementary Information ◽

Genomic Sequencing ◽

Command Line ◽

Supplementary Data ◽

Sequencing Data ◽

Binary Arithmetic ◽

Straightforward Solution

Abstract Motivation In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. Availability and implementation The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text