Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

Impact of gene annotation choice on the quantification of RNA-seq data

10.1101/2021.01.07.425794 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Rna Seq ◽

Microarray Expression Data ◽

Refseq Annotation ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Microarray Expression ◽

Expression Quantification

RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from $>$800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.

Download Full-text

XenoCP: Cloud-based BAM cleansing tool for RNA and DNA from Xenograft

10.1101/843250 ◽

2019 ◽

Cited By ~ 2

Author(s):

Michael Rusch ◽

Liang Ding ◽

Sasi Arunachalam ◽

Andrew Thrasher ◽

Hongjian Jin ◽

...

Keyword(s):

Gene Expression ◽

Tumor Heterogeneity ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Link Type ◽

Gene Expression Quantification ◽

Expression Quantification ◽

Generation Sequencing ◽

Rna And Dna

ABSTRACTSummaryXenografts are important models for cancer research and the presence of mouse reads in xenograft next generation sequencing data can potentially confound interpretation of experimental results. We present an efficient, cloud-based BAM-to-BAM cleaning tool called XenoCP to remove mouse reads from xenograft BAM files. We show application of XenoCP in obtaining accurate gene expression quantification in RNA-seq and tumor heterogeneity in WGS of xenografts derived from brain and solid tumors.Availability and ImplementationSt. Jude Cloud (https://pecan.stjude.cloud/permalink/xenocp) and St. Jude Github (https://github.com/stjude/XenoCP)

Download Full-text

Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensivein silicoassessment of RNA-seq experiments

Molecular Ecology ◽

10.1111/mec.12014 ◽

2012 ◽

Vol 22 (3) ◽

pp. 620-634 ◽

Cited By ~ 167

Author(s):

Nagarjun Vijay ◽

Jelmer W. Poelstra ◽

Axel Künstner ◽

Jochen B. W. Wolf

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Transcriptome Assembly ◽

Rna Seq ◽

Gene Expression Quantification ◽

Differential Gene ◽

Expression Quantification ◽

Challenges And Strategies

Download Full-text

The effect of human genome annotation complexity on RNA-Seq gene expression quantification

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops ◽

10.1109/bibmw.2012.6470224 ◽

2012 ◽

Cited By ~ 3

Author(s):

Po-Yen Wu ◽

John H. Phan ◽

May D. Wang

Keyword(s):

Gene Expression ◽

Human Genome ◽

Genome Annotation ◽

Rna Seq ◽

Gene Expression Quantification ◽

Expression Quantification

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads

10.1101/833962 ◽

2019 ◽

Cited By ~ 3

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Rna Seq ◽

Read Mapping ◽

Genome Wide ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact ◽

Genome Wide Gene Expression

AbstractRNA sequencing (RNA-seq) is currently the standard method for genome-wide gene expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq gene expression, an important task in the analysis of RNA-seq data. In this study, we used a benchmark RNA-seq dataset generated in the SEQC project to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by the read aligner via its ‘soft-clipping’ procedure and many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on expression of >900 genes measured by real-time PCR. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

Principles of transcriptome analysis and gene expression quantification: an RNA ‐seq tutorial

Molecular Ecology Resources ◽

10.1111/1755-0998.12109 ◽

2013 ◽

Vol 13 (4) ◽

pp. 559-572 ◽

Cited By ~ 104

Author(s):

Jochen B. W. Wolf

Keyword(s):

Gene Expression ◽

Transcriptome Analysis ◽

Rna Seq ◽

Gene Expression Quantification ◽

Expression Quantification

Download Full-text

PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution

Nucleic Acids Research ◽

10.1093/nar/gkt1304 ◽

2013 ◽

Vol 42 (3) ◽

pp. e20-e20 ◽

Cited By ~ 29

Author(s):

Yu Hu ◽

Yichuan Liu ◽

Xianyun Mao ◽

Cheng Jia ◽

Jane F. Ferguson ◽

...

Keyword(s):

Gene Expression ◽

Specific Gene ◽

Rna Seq ◽

Specific Gene Expression ◽

Gene Expression Quantification ◽

Read Distribution ◽

Expression Quantification

Download Full-text

Evaluation of RNA-Seq software in gene expression quantification

Journal of Biomedical Science and Engineering ◽

10.4236/jbise.2013.64059 ◽

2013 ◽

Vol 06 (04) ◽

pp. 473-477

Author(s):

Yan Ji ◽

Ziliang Qian ◽

Jia Wei

Keyword(s):

Gene Expression ◽

Rna Seq ◽

Gene Expression Quantification ◽

Expression Quantification

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Scientific Reports ◽

10.1038/s41598-020-74567-y ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Li Tong ◽

◽

Po-Yen Wu ◽

John H. Phan ◽

Hamid R. Hassazadeh ◽

...

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Disease Outcome ◽

Rna Seq ◽

Next Generation Sequencing Technology ◽

Normalization Methods ◽

The Us ◽

Sequencing Quality ◽

Improved Accuracy ◽

The Impact

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.

Download Full-text