scholarly journals NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

iScience ◽  
2021 ◽  
pp. 102361
Author(s):  
Eliah G. Overbey ◽  
Amanda M. Saravia-Butler ◽  
Zhe Zhang ◽  
Komal S. Rathi ◽  
Homer Fogle ◽  
...  
Keyword(s):  
Rna Seq ◽  
2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2014 ◽  
Vol 30 (12) ◽  
pp. i274-i282 ◽  
Author(s):  
Pavankumar Videm ◽  
Dominic Rose ◽  
Fabrizio Costa ◽  
Rolf Backofen

PLoS ONE ◽  
2015 ◽  
Vol 10 (4) ◽  
pp. e0123730 ◽  
Author(s):  
Fenggang Li ◽  
Lixin Wang ◽  
Qingjing Lan ◽  
Hui Yang ◽  
Yang Li ◽  
...  

2020 ◽  
Author(s):  
Eliah G. Overbey ◽  
Amanda M. Saravia-Butler ◽  
Zhe Zhang ◽  
Komal S. Rathi ◽  
Homer Fogle ◽  
...  

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.


Author(s):  
Marine Guilcher ◽  
Arnaud Liehrmann ◽  
Chloé Seyman ◽  
Thomas Blein ◽  
Guillem Rigaill ◽  
...  

Plastid gene expression involves many post-transcriptional maturation steps resulting in a complex transcriptome composed of multiple isoforms. Although short read RNA-seq has considerably improved our understanding of the molecular mechanisms controlling these processes, it is unable to sequence full-length transcripts. This information is however crucial when it comes to understand the interplay between the various steps of plastid gene expression. Here, the study of the Arabidopsis leaf plastid transcriptome using Nanopore sequencing showed that many splicing and editing events were not independent but co-occurring. For a given transcript, maturation events also appeared to be chronologically ordered with splicing happening after most sites are edited.


2015 ◽  
Author(s):  
Brad Solomon ◽  
Carleton Kingsford

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.


2010 ◽  
Vol 11 (5) ◽  
pp. R50 ◽  
Author(s):  
Jun Li ◽  
Hui Jiang ◽  
Wing Wong
Keyword(s):  
Rna Seq ◽  

2018 ◽  
Author(s):  
Paula Pérez-Rubio ◽  
Claudio Lottaz ◽  
Julia C Engelmann

AbstractBackgroundRNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing.ResultsWe here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. Conclusions: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3.


2020 ◽  
Author(s):  
Vinay S. Swamy ◽  
Temesgen D. Fufa ◽  
Robert B. Hufnagel ◽  
David M. McGaughey

AbstractDe novo transcriptome construction from short-read RNA-seq is a common method for reconstructing mRNA transcripts within a given sample. However, the precision of this process is unclear as it is difficult to obtain a ground-truth measure of transcript expression. With advances in third generation sequencing, full length transcripts of whole transcriptomes can be accurately sequenced to generate a ground-truth transcriptome. We generated long-read PacBio and short-read Illumina RNA-seq data from a human induced pluripotent stem cell- derived retinal pigmented epithelium (iPSC-RPE) cell line. We use long-read data to identify simple metrics for assessing de novo transcriptome construction and optimize a short-read based de novo transcriptome construction pipeline. We apply this this pipeline to construct transcriptomes for 340 short-read RNA-seq samples originating from healthy adult and fetal human retina, cornea, and RPE. We identify hundreds of novel gene isoforms and examine their significance in the context of ocular development and disease.


Sign in / Sign up

Export Citation Format

Share Document