NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text

Mapping and differential expression analysis from short-read RNA-Seq data in model organisms

Quantitative Biology ◽

10.1007/s40484-016-0060-7 ◽

2016 ◽

Vol 4 (1) ◽

pp. 22-35 ◽

Cited By ~ 2

Author(s):

Qiong-Yi Zhao ◽

Jacob Gratten ◽

Restuadi Restuadi ◽

Xuan Li

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Model Organisms ◽

Rna Seq ◽

Short Read

Download Full-text

BlockClust: efficient clustering and classification of non-coding RNAs from short read RNA-seq profiles

Bioinformatics ◽

10.1093/bioinformatics/btu270 ◽

2014 ◽

Vol 30 (12) ◽

pp. i274-i282 ◽

Cited By ~ 13

Author(s):

Pavankumar Videm ◽

Dominic Rose ◽

Fabrizio Costa ◽

Rolf Backofen

Keyword(s):

Rna Seq ◽

Short Read ◽

Clustering And Classification ◽

Non Coding Rnas

Download Full-text

RNA-Seq Analysis and Gene Discovery of Andrias davidianus Using Illumina Short Read Sequencing

PLoS ONE ◽

10.1371/journal.pone.0123730 ◽

2015 ◽

Vol 10 (4) ◽

pp. e0123730 ◽

Cited By ~ 16

Author(s):

Fenggang Li ◽

Lixin Wang ◽

Qingjing Lan ◽

Hui Yang ◽

Yang Li ◽

...

Keyword(s):

Gene Discovery ◽

Rna Seq ◽

Andrias Davidianus ◽

Short Read ◽

Short Read Sequencing

Download Full-text

NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

10.1101/2020.11.06.371724 ◽

2020 ◽

Author(s):

Eliah G. Overbey ◽

Amanda M. Saravia-Butler ◽

Zhe Zhang ◽

Komal S. Rathi ◽

Homer Fogle ◽

...

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Rna Seq ◽

Sequencing Data ◽

Analysis Pipeline ◽

Short Read ◽

Gene Quantification ◽

Working Groups ◽

Using Data ◽

Data Analysis Pipeline

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.

Download Full-text

Full Length Transcriptome Highlights the Coordination of Plastid Transcript Processing

10.20944/preprints202108.0571.v1 ◽

2021 ◽

Author(s):

Marine Guilcher ◽

Arnaud Liehrmann ◽

Chloé Seyman ◽

Thomas Blein ◽

Guillem Rigaill ◽

...

Keyword(s):

Gene Expression ◽

Molecular Mechanisms ◽

Full Length ◽

Nanopore Sequencing ◽

Rna Seq ◽

Plastid Gene ◽

Plastid Gene Expression ◽

Short Read ◽

Transcript Processing

Plastid gene expression involves many post-transcriptional maturation steps resulting in a complex transcriptome composed of multiple isoforms. Although short read RNA-seq has considerably improved our understanding of the molecular mechanisms controlling these processes, it is unable to sequence full-length transcripts. This information is however crucial when it comes to understand the interplay between the various steps of plastid gene expression. Here, the study of the Arabidopsis leaf plastid transcriptome using Nanopore sequencing showed that many splicing and editing events were not independent but co-occurring. For a given transcript, maturation events also appeared to be chronologically ordered with splicing happening after most sites are edited.

Download Full-text

Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

10.1101/017087 ◽

2015 ◽

Cited By ~ 6

Author(s):

Brad Solomon ◽

Carleton Kingsford

Keyword(s):

Large Scale ◽

Rna Seq ◽

Large Collection ◽

Short Read ◽

Indexing Scheme ◽

Short Read Sequencing ◽

Sequence Read Archive ◽

Gene Isoform ◽

Expressed Sequence ◽

Novel Isoforms

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.

Download Full-text

Modeling non-uniformity in short-read rates in RNA-Seq data

Genome Biology ◽

10.1186/gb-2010-11-5-r50 ◽

2010 ◽

Vol 11 (5) ◽

pp. R50 ◽

Cited By ~ 127

Author(s):

Jun Li ◽

Hui Jiang ◽

Wing Wong

Keyword(s):

Rna Seq ◽

Short Read

Download Full-text

FastqPuri: high-performance preprocessing of RNA-seq data

10.1101/480707 ◽

2018 ◽

Author(s):

Paula Pérez-Rubio ◽

Claudio Lottaz ◽

Julia C Engelmann

Keyword(s):

High Performance ◽

Sequence Data ◽

Transcript Expression ◽

Memory Usage ◽

Rna Seq ◽

Sequencing Data ◽

Total Size ◽

Short Read ◽

Quality Filtering ◽

Sequence Quality

AbstractBackgroundRNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing.ResultsWe here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. Conclusions: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3.

Download Full-text

A long read optimized de novo transcriptome pipeline reveals novel ocular developmentally regulated gene isoforms and disease targets

10.1101/2020.08.21.261644 ◽

2020 ◽

Author(s):

Vinay S. Swamy ◽

Temesgen D. Fufa ◽

Robert B. Hufnagel ◽

David M. McGaughey

Keyword(s):

De Novo ◽

Induced Pluripotent Stem Cell ◽

Ground Truth ◽

Rna Seq ◽

Developmentally Regulated ◽

Short Read ◽

Third Generation Sequencing ◽

Long Read ◽

Gene Isoforms ◽

Induced Pluripotent

AbstractDe novo transcriptome construction from short-read RNA-seq is a common method for reconstructing mRNA transcripts within a given sample. However, the precision of this process is unclear as it is difficult to obtain a ground-truth measure of transcript expression. With advances in third generation sequencing, full length transcripts of whole transcriptomes can be accurately sequenced to generate a ground-truth transcriptome. We generated long-read PacBio and short-read Illumina RNA-seq data from a human induced pluripotent stem cell- derived retinal pigmented epithelium (iPSC-RPE) cell line. We use long-read data to identify simple metrics for assessing de novo transcriptome construction and optimize a short-read based de novo transcriptome construction pipeline. We apply this this pipeline to construct transcriptomes for 340 short-read RNA-seq samples originating from healthy adult and fetal human retina, cornea, and RPE. We identify hundreds of novel gene isoforms and examine their significance in the context of ocular development and disease.

Download Full-text