Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples

Mapping Intimacies ◽

10.1101/097881 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christopher Wilks ◽

Phani Gaddipati ◽

Abhinav Nellore ◽

Ben Langmead

Keyword(s):

Tissue Specificity ◽

Rna Seq ◽

Sequencing Data ◽

Transcription Start ◽

Link Type ◽

Alternative Transcription ◽

Web App ◽

Inverted Indexing ◽

Splice Junctions ◽

Splicing Patterns

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.

GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

10.1101/771063 ◽

2019 ◽

Cited By ~ 2

Author(s):

Bastian Seelbinder ◽

Thomas Wolf ◽

Steffen Priebe ◽

Sylvie McNamara ◽

Silvia Gerber ◽

...

Keyword(s):

Gene Expression ◽

Single Species ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Sequencing Data ◽

Interacting Species ◽

Link Type ◽

Fastq Format ◽

Standard Tool ◽

Processing Steps

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.

NASQAR: A web-based platform for high-throughput sequencing data analysis and visualization

10.1101/709980 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ayman Yousif ◽

Nizar Drou ◽

Jillian Rowe ◽

Mohammed Khalfan ◽

Kristin C Gunsalus

Keyword(s):

New York ◽

Data Analysis ◽

Open Source ◽

High Throughput ◽

High Throughput Sequencing ◽

Web Applications ◽

Rna Seq ◽

Sequencing Data ◽

Web Based ◽

Link Type

AbstractBackgroundAs high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource).ResultsNASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology.ConclusionsNASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.

Thousands of exon skipping events differentiate among splicing patterns in sixteen human tissues

F1000Research ◽

10.12688/f1000research.2-188.v1 ◽

2013 ◽

Vol 2 ◽

pp. 188 ◽

Cited By ~ 131

Author(s):

Liliana Florea ◽

Li Song ◽

Steven L Salzberg

Keyword(s):

Alternative Splicing ◽

Gene Diversity ◽

Exon Skipping ◽

Human Tissues ◽

Rna Seq ◽

Gene Splicing ◽

Link Type ◽

Normal Human ◽

Splicing Patterns

Alternative splicing is widely recognized for its roles in regulating genes and creating gene diversity. However, despite many efforts, the repertoire of gene splicing variation is still incompletely characterized, even in humans. Here we describe a new computational system, ASprofile, and its application to RNA-seq data from Illumina’s Human Body Map project (>2.5 billion reads). Using the system, we identified putative alternative splicing events in 16 different human tissues, which provide a dynamic picture of splicing variation across the tissues. We detected 26,989 potential exon skipping events representing differences in splicing patterns among the tissues. A large proportion of the events (>60%) were novel, involving new exons (~3000), new introns (~16000), or both. When tracing these events across the sixteen tissues, only a small number (4-7%) appeared to be differentially expressed (‘switched’) between two tissues, while 30-45% showed little variation, and the remaining 50-65% were not present in one or both tissues compared. Novel exon skipping events appeared to be slightly less variable than known events, but were more tissue-specific. Our study represents the first effort to build a comprehensive catalog of alternative splicing in normal human tissues from RNA-seq data, while providing insights into the role of alternative splicing in shaping tissue transcriptome differences. The catalog of events and the ASprofile software are freely available from the Zenodo repository(http://zenodo.org/record/7068; doi:10.5281/zenodo.7068) and from our web site http://ccb.jhu.edu/software/ASprofile.

CIDANE: Comprehensive isoform discovery and abundance estimation

10.1101/017939 ◽

2015 ◽

Cited By ~ 1

Author(s):

Stefan Canzar ◽

Sandro Andreotti ◽

David Weese ◽

Knut Reinert ◽

Gunnar W. Klau

Keyword(s):

Boundary Data ◽

Model Organisms ◽

Integrated Analysis ◽

Abundance Estimation ◽

Rna Seq ◽

Splice Sites ◽

Transcription Start ◽

Transcript Reconstruction ◽

Splice Junctions ◽

Higher Sensitivity

We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts with significantly higher sensitivity and precision than existing tools, while competing in speed with the fastest methods. In addition to reconstructing transcripts ab initio, the algorithm also allows to make use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. CIDANE supports the integrated analysis of RNA-seq and additional gene-boundary data and recovers splice junctions that are invisible to other methods. CIDANE is available at http://ccb.jhu.edu/software/cidane/.

mountainClimber Identifies Alternative Transcription Start and Polyadenylation Sites in RNA-Seq

Cell Systems ◽

10.1016/j.cels.2019.07.011 ◽

2019 ◽

Vol 9 (4) ◽

pp. 393-400.e6 ◽

Cited By ~ 1

Author(s):

Ashley A. Cass ◽

Xinshu Xiao

Keyword(s):

Rna Seq ◽

Transcription Start ◽

Alternative Transcription ◽

Polyadenylation Sites

Quark enables semi-reference-based compression of RNA-seq data

10.1101/085878 ◽

2016 ◽

Author(s):

Hirak Sarkar ◽

Rob Patro

Keyword(s):

State Of The Art ◽

Reference Sequence ◽

Rna Seq ◽

Sequencing Data ◽

The Past ◽

Link Type ◽

Exponential Increase

AbstractMotivationThe past decade has seen an exponential increase in biological sequencing capacity, and there has been a simultaneous effort to help organize and archive some of the vast quantities of sequencing data that are being generated. While these developments are tremendous from the perspective of maximizing the scientific utility of available data, they come with heavy costs. The storage and transmission of such vast amounts of sequencing data is expensive.ResultsWe present Quark, a semi-reference-based compression tool designed for RNA-seq data. Quark makes use of a reference sequence when encoding reads, but produces a representation that can be decoded independently, without the need for a reference. This allows Quark to achieve markedly better compression rates than existing reference-free schemes, while still relieving the burden of assuming a specific, shared reference sequence between the encoder and decoder. We demonstrate that Quark achieves state-of-the-art compression rates, and that, typically, only a small fraction of the reference sequence must be encoded along with the reads to allow reference-free decompression.AvailabilityQuark is implemented in C++11, and is available under a GPLv3 license at www.github.com/COMBINE-lab/[email protected]

The Lair: A resource for exploratory analysis of published RNA-Seq data

10.1101/056200 ◽

2016 ◽

Author(s):

Harold Pimentel ◽

Pascal Sturmfels ◽

Nicolas Bray ◽

Páll Melsted ◽

Lior Pachter

Keyword(s):

Large Scale ◽

Exploratory Analysis ◽

Technical Expertise ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Link Type ◽

Short Read Archive ◽

Published Research

AbstractIncreased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is typically not easily usable in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Short Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair

Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons

10.1101/230474 ◽

2017 ◽

Cited By ~ 1

Author(s):

Kemal Eren ◽

Steven Weaver ◽

Robert Ketteringham ◽

Morné Valentyn ◽

Melissa Laird Smith ◽

...

Keyword(s):

Web Application ◽

Evolutionary Dynamics ◽

Full Length ◽

Viral Population ◽

Sequencing Data ◽

Link Type ◽

Long Read ◽

Web App ◽

Client Side ◽

Hiv 1

AbstractNext generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV-1 env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data.FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN /dS) across time and across protein structure, and a phylogenetic tree browser.We demonstrate how FLEA may be used to process Pacific Biosciences HIV-1 env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV-1 env populations.A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.

Genome-wide Identification of Zero Nucleotide Recursive Splicing inDrosophila

10.1101/006163 ◽

2014 ◽

Cited By ~ 1

Author(s):

Michael O Duff ◽

Sara Olson ◽

Xintao Wei ◽

Ahmad Osman ◽

Alex Plocik ◽

...

Keyword(s):

Cultured Cells ◽

Developmental Time ◽

Rna Seq ◽

Splice Sites ◽

Sequencing Data ◽

Genome Wide ◽

Evolutionarily Conserved ◽

And Function ◽

Splice Junctions ◽

Recursive Splicing

Recursive splicing is a process in which large introns are removed in multiple steps by resplicing at ratchet points - 5? splice sites recreated after splicing. Recursive splicing was first identified in the Drosophila Ultrabithorax (Ubx) gene and only three additional Drosophila genes have since been experimentally shown to undergo recursive splicing. Here, we identify 196 zero nucleotide exon ratchet points in 130 introns of 115 Drosophila genes from total RNA sequencing data generated from developmental time points, dissected tissues, and cultured cells. Recursive splicing events were identified by splice junctions that map to annotated 5? splice sites and unannotated intronic 3? splice sites, the presence of the sequence AG/GT at the 3? splice site, and a 5? to 3? gradient of decreasing RNA-Seq read density indicative of co-transcriptional splicing. The sequential nature of recursive splicing was confirmed by identification of lariat introns generated by splicing to and from the ratchet points. We also show that recursive splicing is a constitutive process, and that the sequence and function of ratchet points are evolutionarily conserved. Together these results indicate that recursive splicing is commonly used in Drosophila and provides insight into the mechanisms by which some introns are removed.

S-IRFindeR: stable and accurate measurement of intron retention

10.1101/2020.06.25.164699 ◽

2020 ◽

Author(s):

Lucile Broseus ◽

William Ritchie

Keyword(s):

Intron Retention ◽

Rna Seq ◽

Sequencing Data ◽

New Approach ◽

Retention Ratio ◽

Link Type ◽

Long Read ◽

Retained Introns ◽

Accurate Quantification ◽

Novel Algorithm

AbstractAccurate quantification of intron retention levels is currently the crux for detecting and interpreting the function of retained introns. Using both simulated and real RNA-seq datasets, we show that current methods suffer from several biases and artefacts, which impair the analysis of intron retention. We designed a new approach to measure intron retention levels called the Stable Intron Retention ratio that we have implemented in a novel algorithm to detect and measure intron retention called S-IRFindeR. We demonstrate that it provides a significant improvement in accuracy, higher consistency between replicates and agreement with IR-levels computed from long-read sequencing data.S-IRFindeR is freely available at: https://github.com/lbroseus/SIRFindeR/.