Shark: fishing relevant reads in an RNA-Seq sample

Abstract Motivation Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability and implementation The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Shark: fishing in a sample to discard useless RNA-Seq reads

10.1101/836130 ◽

2019 ◽

Author(s):

Paola Bonizzoni ◽

Tamara Ceccato ◽

Gianluca Della Vedova ◽

Luca Denti ◽

Yuri Pirola ◽

...

Keyword(s):

Alternative Splicing ◽

High Throughput ◽

Rna Seq ◽

Differential Splicing ◽

Massive Datasets ◽

Computational Problem ◽

Gene Assignment ◽

Alignment Free ◽

Input Dataset ◽

Alternative Splicing Events

Recent advances in high throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing such useless reads from the input dataset leads to improved efficiency without compromising the results of the study.To this aim, in this paper we introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given a RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events.We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results.The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark.

Download Full-text

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

10.1101/173997 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhun Miao ◽

Ke Deng ◽

Xiaowo Wang ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Differential Expression ◽

Negative Binomial ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Binomial Model ◽

Supplementary Data ◽

Rna Seq ◽

Real Zeros

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.

Download Full-text

LIONS: analysis suite for detecting and quantifying transposable element initiated transcription from RNA-seq

Bioinformatics ◽

10.1093/bioinformatics/btz130 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3839-3841 ◽

Cited By ~ 6

Author(s):

Artem Babaian ◽

I Richard Thompson ◽

Jake Lever ◽

Liane Gagnier ◽

Mohammad M Karimi ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Test Data ◽

Source Code ◽

Supplementary Information ◽

Transcriptional Networks ◽

Supplementary Data ◽

Rna Seq ◽

Transcriptional Initiation ◽

Instruction Manual

Abstract Summary Transposable elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-derived transcriptional initiation contributes to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of RNA-seq data to specifically detect and quantify TE-initiated transcripts. Availability and implementation Source code, container, test data and instruction manual are freely available at www.github.com/ababaian/LIONS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences

Bioinformatics ◽

10.1093/bioinformatics/btz247 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4469-4471 ◽

Cited By ~ 21

Author(s):

Kristoffer Vitting-Seerup ◽

Albin Sandelin

Keyword(s):

Alternative Splicing ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Rna Seq ◽

Genome Wide ◽

Functional Consequences ◽

Cancer Genome Atlas ◽

Health And Disease ◽

Splicing Patterns

Abstract Summary Alternative splicing is an important mechanism involved in health and disease. Recent work highlights the importance of investigating genome-wide changes in splicing patterns and the subsequent functional consequences. Current computational methods only support such analysis on a gene-by-gene basis. Therefore, we extended IsoformSwitchAnalyzeR R library to enable analysis of genome-wide changes in specific types of alternative splicing and predicted functional consequences of the resulting isoform switches. As a case study, we analyzed RNA-seq data from The Cancer Genome Atlas and found systematic changes in alternative splicing and the consequences of the associated isoform switches. Availability and implementation Windows, Linux and Mac OS: http://bioconductor.org/packages/IsoformSwitchAnalyzeR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MAJIQ-SPEL: web-tool to interrogate classical and complex splicing variations from RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btx565 ◽

2017 ◽

Vol 34 (2) ◽

pp. 300-302 ◽

Cited By ~ 2

Author(s):

Christopher J Green ◽

Matthew R Gazzara ◽

Yoseph Barash

Keyword(s):

Experimental Validation ◽

Ucsc Genome Browser ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Web Tool ◽

Rt Pcr ◽

Design Algorithm ◽

Gene Isoforms ◽

Downstream Analysis

Abstract Summary Analysis of RNA sequencing (RNA-Seq) data have highlighted the fact that most genes undergo alternative splicing (AS) and that these patterns are tightly regulated. Many of these events are complex, resulting in numerous possible isoforms that quickly become difficult to visualize, interpret and experimentally validate. To address these challenges we developed MAJIQ-SPEL, a web-tool that takes as input local splicing variations (LSVs) quantified from RNA-Seq data and provides users with visualization and quantification of gene isoforms associated with those. Importantly, MAJIQ-SPEL is able to handle both classical (binary) and complex, non-binary, splicing variations. Using a matching primer design algorithm it also suggests to users possible primers for experimental validation by RT-PCR and displays those, along with the matching protein domains affected by the LSV, on UCSC Genome Browser for further downstream analysis. Availability and implementation Program and code will be available athttp://majiq.biociphers.org/majiq-spel. Supplementary information Supplementary data are available atBioinformatics online.

Download Full-text

Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles

Bioinformatics ◽

10.1093/bioinformatics/btab091 ◽

2021 ◽

Author(s):

Davide Risso ◽

Stefano Maria Pagnotta

Keyword(s):

Single Cell ◽

Expression Profiles ◽

Unsupervised Clustering ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Data Transformations ◽

The Impact

Abstract Motivation Data transformations are an important step in the analysis of RNA-seq data. Nonetheless, the impact of transformation on the outcome of unsupervised clustering procedures is still unclear. Results Here, we present an Asymmetric Winsorization per Sample Transformation (AWST), which is robust to data perturbations and removes the need for selecting the most informative genes prior to sample clustering. Our procedure leads to robust and biologically meaningful clusters both in bulk and in single-cell applications. Availability The AWST method is available at https://github.com/drisso/awst. The code to reproduce the analyses is available at https://github.com/drisso/awst\_analysis. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Issues of Z-factor and an approach to avoid them for quality control in high-throughput screening studies

Bioinformatics ◽

10.1093/bioinformatics/btaa1049 ◽

2020 ◽

Author(s):

Xiaohua Douglas Zhang ◽

Dandan Wang ◽

Shixue Sun ◽

Heping Zhang

Keyword(s):

Quality Control ◽

High Throughput ◽

High Throughput Screening ◽

Theoretical Basis ◽

Sampling Error ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Automation Technology ◽

General Public License

Abstract Motivation High-throughput screening (HTS) is a vital automation technology in biomedical research in both industry and academia. The well-known Z-factor has been widely used as a gatekeeper to assure assay quality in an HTS study. However, many researchers and users may not have realized that Z-factor has major issues. Results In this article, the following four major issues are explored and demonstrated so that researchers may use the Z-factor appropriately. First, the Z-factor violates the Pythagorean theorem of statistics. Second, there is no adjustment of sampling error in the application of the Z-factor for quality control (QC) in HTS studies. Third, the expectation of the sample-based Z-factor does not exist. Fourth, the thresholds in the Z-factor-based criterion lack a theoretical basis. Here, an approach to avoid these issues was proposed and new QC criteria under homoscedasticity were constructed so that researchers can choose a statistically grounded criterion for QC in the HTS studies. We implemented this approach in an R package and demonstrated its utility in multiple CRISPR/CAS9 or siRNA HTS studies. Availability and implementation The R package qcSSMDhomo is freely available from GitHub: https://github.com/Karena6688/qcSSMDhomo. The file qcSSMDhomo_1.0.0.tar.gz (for Windows) containing qcSSMDhomo is also available at Bioinformatics online. qcSSMDhomo is distributed under the GNU General Public License. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SpliceLauncher: a tool for detection, annotation and relative quantification of alternative junctions from RNAseq data

Bioinformatics ◽

10.1093/bioinformatics/btz784 ◽

2019 ◽

Author(s):

Raphaël Leman ◽

Valentin Harter ◽

Alexandre Atkinson ◽

Grégoire Davy ◽

Antoine Rousselin ◽

...

Keyword(s):

Alternative Splicing ◽

High Resolution ◽

Alternative Splice ◽

Biological Process ◽

Supplementary Information ◽

Molecular Diagnostic ◽

Relative Quantification ◽

Supplementary Data ◽

Rnaseq Data ◽

Splice Junctions

Abstract Summary Alternative splicing is an important biological process widely analyzed in molecular diagnostic settings. Indeed, a variant can be pathogenic by splicing alteration and a suspected pathogenic variant (e.g. truncating variant) can be rescued by splicing. In this context, detecting and quantifying alternative splicing is challenging. We developed SpliceLauncher, a fast and easy to use open source tool that aims at detecting, annotating and quantifying alternative splice junctions at high resolution. Availability and implementation SpliceLauncher is available at https://github.com/raphaelleman/SpliceLauncher. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPress: a framework for querying general feature format (GFF) files and expression files in a compressed form

Bioinformatics ◽

10.1093/bioinformatics/btaa604 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4810-4812

Author(s):

Qingxi Meng ◽

Idoia Ochoa ◽

Mikel Hernaez

Keyword(s):

Single Cell ◽

Data Streams ◽

General Feature ◽

Supplementary Information ◽

Storage Space ◽

Supplementary Data ◽

Rna Seq ◽

Sequencing Data ◽

General Feature Format ◽

Original File

Abstract Motivation Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. Availability and implementation GPress is freely available at https://github.com/qm2/gpress. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text