RiboFlow, RiboR and RiboPy: an ecosystem for analyzing ribosome profiling data at read length resolution

Abstract Summary Ribosome occupancy measurements enable protein abundance estimation and infer mechanisms of translation. Recent studies have revealed that sequence read lengths in ribosome profiling data are highly variable and carry critical information. Consequently, data analyses require the computation and storage of multiple metrics for a wide range of ribosome footprint lengths. We developed a software ecosystem including a new efficient binary file format named ‘ribo’. Ribo files store all essential data grouped by ribosome footprint lengths. Users can assemble ribo files using our RiboFlow pipeline that processes raw ribosomal profiling sequencing data. RiboFlow is highly portable and customizable across a large number of computational environments with built-in capabilities for parallelization. We also developed interfaces for writing and reading ribo files in the R (RiboR) and Python (RiboPy) environments. Using RiboR and RiboPy, users can efficiently access ribosome profiling quality control metrics, generate essential plots and carry out analyses. Altogether, these components create a software ecosystem for researchers to study translation through ribosome profiling. Availability and implementation For a quickstart, please see https://ribosomeprofiling.github.io. Source code, installation instructions and links to documentation are available on GitHub: https://github.com/ribosomeprofiling. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RiboFlow, RiboR and RiboPy: An ecosystem for analyzing ribosome profiling data at read length resolution

10.1101/855445 ◽

2019 ◽

Author(s):

Hakan Ozadam ◽

Michael Geng ◽

Can Cenik

Keyword(s):

Ribosome Profiling ◽

Read Length ◽

Sequencing Data ◽

Software Ecosystem ◽

Binary File ◽

Link Type ◽

Wide Range ◽

Quality Control Metrics ◽

Multiple Metrics ◽

And Storage

AbstractSummaryRibosome occupancy measurements enable protein abundance estimation and infer mechanisms of translation. Recent studies have revealed that sequence read lengths in ribosome profiling data are highly variable and carry critical information. Consequently, data analyses require the computation and storage of multiple metrics for a wide range of ribosome footprint lengths. We developed a software ecosystem including a new efficient binary file format named ‘ribo’. Ribo files store all essential data grouped by ribosome footprint lengths. Users can assemble ribo files using our RiboFlow pipeline that processes raw ribosomal profiling sequencing data. RiboFlow is highly portable and customizable across a large number of computational environments with built-in capabilities for parallelization. We also developed interfaces for writing and reading ribo files in the R (RiboR) and Python (RiboPy) environments. Using RiboR and RiboPy, users can efficiently access ribosome profiling quality control metrics, generate essential plots, and carry out analyses. Altogether, these components create a complete software ecosystem for researchers to study translation through ribosome profiling.Availability and ImplementationFor a quickstart, please see https://ribosomeprofiling.github.io. Source code, installation instructions and links to documentation are available on GitHub: https://github.com/ribosomeprofiling

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

An ultra-sensitive T-cell receptor detection method for TCR-Seq and RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa432 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4255-4262

Author(s):

Si-Yi Chen ◽

Chun-Jie Liu ◽

Qiong Zhang ◽

An-Yuan Guo

Keyword(s):

T Cell ◽

Single Cell ◽

High Performance ◽

Cell Receptor ◽

Computational Method ◽

Read Length ◽

Supplementary Information ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data

Abstract Motivation T-cell receptors (TCRs) function to recognize antigens and play vital roles in T-cell immunology. Surveying TCR repertoires by characterizing complementarity-determining region 3 (CDR3) is a key issue. Due to the high diversity of CDR3 and technological limitation, accurate characterization of CDR3 repertoires remains a great challenge. Results We propose a computational method named CATT for ultra-sensitive and precise TCR CDR3 sequences detection. CATT can be applied on TCR sequencing, RNA-Seq and single-cell TCR(RNA)-Seq data to characterize CDR3 repertoires. CATT integrated de Bruijn graph-based micro-assembly algorithm, data-driven error correction model and Bayesian inference algorithm, to self-adaptively and ultra-sensitively characterize CDR3 repertoires with high performance. Benchmark results of datasets from in silico and experimental data demonstrated that CATT showed superior recall and precision compared with existing tools, especially for data with short read length and small size and single-cell sequencing data. Thus, CATT will be a useful tool for TCR analysis in researches of cancer and immunology. Availability and implementation http://bioinfo.life.hust.edu.cn/CATT or https://github.com/GuoBioinfoLab/CATT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa171 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3561-3562 ◽

Cited By ~ 8

Author(s):

Kun Sun

Keyword(s):

Data Preprocessing ◽

Poor Quality ◽

Read Length ◽

Supplementary Information ◽

Sequencing Data ◽

Efficient Tool ◽

Source Codes ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing (NGS) data frequently suffer from poor-quality cycles and adapter contaminations therefore need to be preprocessed before downstream analyses. With the ever-growing throughput and read length of modern sequencers, the preprocessing step turns to be a bottleneck in data analysis due to unmet performance of current tools. Extra-fast and accurate adapter- and quality-trimming tools for sequencing data preprocessing are therefore still of urgent demand. Results Ktrim was developed in this work. Key features of Ktrim include: built-in support to adapters of common library preparation kits; supports user-supplied, customized adapter sequences; supports both paired-end and single-end data; supports parallelization to accelerate the analysis. Ktrim was ∼2–18 times faster than current tools and also showed high accuracy when applied on the testing datasets. Ktrim could thus serve as a valuable and efficient tool for short-read NGS data preprocessing. Availability and implementation Source codes and scripts to reproduce the results descripted in this article are freely available at https://github.com/hellosunking/Ktrim/, distributed under the GPL v3 license. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Canvas SPW: calling de novo copy number variants in pedigrees

10.1101/121939 ◽

2017 ◽

Author(s):

Sergii Ivakhno ◽

Eric Roller ◽

Camilla Colombo ◽

Philip Tedder ◽

Anthony J. Cox

Keyword(s):

Copy Number ◽

De Novo ◽

Late Onset ◽

Genetic Diseases ◽

Copy Number Variants ◽

Variant Calling ◽

Supplementary Information ◽

Sequencing Data ◽

Pedigree Structure ◽

Wide Range

AbstractMotivationWhole genome sequencing is becoming a diagnostics of choice for the identification of rare inherited and de novo copy number variants in families with various pediatric and late-onset genetic diseases. However, joint variant calling in pedigrees is hampered by the complexity of consensus breakpoint alignment across samples within an arbitrary pedigree structure.ResultsWe have developed a new tool, Canvas SPW, for the identification of inherited and de novo copy number variants from pedigree sequencing data. Canvas SPW supports a number of family structures and provides a wide range of scoring and filtering options to automate and streamline identification of de novo variants.AvailabilityCanvas SPW is available for download from https://github.com/Illumina/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Variant calling tool evaluation for variable size indel calling from next generation whole genome and targeted sequencing data

10.1101/2021.07.15.452444 ◽

2021 ◽

Author(s):

Ning Wang ◽

Vladislav Lysenkov ◽

Katri Orte ◽

Veli Kairisto ◽

Juhani Aakko ◽

...

Keyword(s):

Variant Calling ◽

Read Length ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

High Coverage ◽

Wide Range ◽

Long Read ◽

Indel Calling ◽

Next Generation Sequencing Ngs

Insertions and deletions (indels) in human genomes are associated with a wide range of phenotypes, including various clinical disorders. High-throughput, next generation sequencing (NGS) technologies enable detection of short genetic variants, such as single nucleotide variants (SNVs) and indels. However, the variant calling accuracy for indels remains considerably lower than for SNVs. Here we present a comparative study of the performance of variant calling tools on indel calling, evaluated with a wide repertoire of NGS datasets. While there is no single optimal tool to suit all circumstances, our results demonstrate that the choice of variant calling tool greatly impacts the precision and recall of indel calling. Furthermore, to reliably detect indels, it is essential to choose NGS technologies that offer a long read length and high coverage, coupled with specific variant calling tools.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa105 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3276-3278 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Simulation Method ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Sequencing Data ◽

Wide Range ◽

Single Cell Rna Sequencing

Abstract Summary SPsimSeq is a semi-parametric simulation method to generate bulk and single-cell RNA-sequencing data. It is designed to simulate gene expression data with maximal retention of the characteristics of real data. It is reasonably flexible to accommodate a wide range of experimental scenarios, including different sample sizes, biological signals (differential expression) and confounding batch effects. Availability and implementation The R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BHap: a novel approach for bacterial haplotype reconstruction

Bioinformatics ◽

10.1093/bioinformatics/btz280 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4624-4631 ◽

Cited By ~ 1

Author(s):

Xin Li ◽

Samaneh Saadat ◽

Haiyan Hu ◽

Xiaoman Li

Keyword(s):

Supplementary Information ◽

Next Generation Sequencing Data ◽

Accurate Estimation ◽

Haplotype Reconstruction ◽

Sequencing Data ◽

Bacterial Populations ◽

Sequencing Errors ◽

Novel Approach ◽

Wide Range ◽

Low Coverage

Abstract Motivation The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. Results In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. Availability and implementation The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SnapperDB: A database solution for routine sequencing analysis of bacterial isolates

10.1101/189118 ◽

2017 ◽

Cited By ~ 1

Author(s):

Timothy Dallman ◽

Philip Ashton ◽

Ulf Schafer ◽

Aleksey Jironkin ◽

Anais Painset ◽

...

Keyword(s):

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Snp Analysis ◽

Sequencing Analysis ◽

Sequencing Data ◽

Bacterial Populations ◽

Nucleotide Resolution ◽

Scalable Analysis ◽

The Relationship ◽

And Storage

AbstractReal-time surveillance of infectious disease using whole genome sequencing data poses challenges in both result generation and communication. SnapperDB represents a set of tools to store bacterial variant data and facilitate reproducible and scalable analysis of bacterial populations. We also introduce the ‘SNP address’ nomenclature to describe the relationship between isolates in a population to the single nucleotide resolution.SummaryWe announce the release of SnapperDB v1.0 a program for scalable routine SNP analysis and storage of microbial populations.AvailabilitySnapperDB is implemented as a python application under the open source BSD license. All code and user guides are available at https://github.com/phe-bioinformatics/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text