Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa068 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Data Analysis ◽

Pearson Correlation ◽

Rna Seq ◽

Genome Wide ◽

Gene Level ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact

Abstract RNA sequencing (RNA-seq) is currently the standard method for genome-wide expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq data, an important task in RNA-seq data analysis. In this study, we used a benchmark RNA-seq dataset and simulation data to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by read aligner via ’soft-clipping’ and that many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on Pearson correlation with reverse transcriptase-polymerase chain reaction data and simulation truth. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads

10.1101/833962 ◽

2019 ◽

Cited By ~ 3

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Rna Seq ◽

Read Mapping ◽

Genome Wide ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact ◽

Genome Wide Gene Expression

AbstractRNA sequencing (RNA-seq) is currently the standard method for genome-wide gene expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq gene expression, an important task in the analysis of RNA-seq data. In this study, we used a benchmark RNA-seq dataset generated in the SEQC project to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by the read aligner via its ‘soft-clipping’ procedure and many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on expression of >900 genes measured by real-time PCR. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

Impact of DNA microarray data transformation on gene expression analysis - comparison of two normalization methods.

Acta Biochimica Polonica ◽

10.18388/abp.2011_2227 ◽

2011 ◽

Vol 58 (4) ◽

Cited By ~ 8

Author(s):

Marcin T Schmidt ◽

Luiza Handschuh ◽

Joanna Zyprych ◽

Alicja Szabelska ◽

Agnieszka K Olejnik-Schmidt ◽

...

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Dna Microarray ◽

Microarray Data ◽

Normalization Method ◽

Differentially Expressed ◽

Microarray Data Analysis ◽

Data Set ◽

Normalization Methods ◽

The Impact

Two-color DNA microarrays are commonly used for the analysis of global gene expression. They provide information on relative abundance of thousands of mRNAs. However, the generated data need to be normalized to minimize systematic variations so that biologically significant differences can be more easily identified. A large number of normalization procedures have been proposed and many softwares for microarray data analysis are available. Here, we have applied two normalization methods (median and loess) from two packages of microarray data analysis softwares. They were examined using a sample data set. We found that the number of genes identified as differentially expressed varied significantly depending on the method applied. The obtained results, i.e. lists of differentially expressed genes, were consistent only when we used median normalization methods. Loess normalization implemented in the two software packages provided less coherent and for some probes even contradictory results. In general, our results provide an additional piece of evidence that the normalization method can profoundly influence final results of DNA microarray-based analysis. The impact of the normalization method depends greatly on the algorithm employed. Consequently, the normalization procedure must be carefully considered and optimized for each individual data set.

Download Full-text

Spaceflight Induces Novel Regulatory Responses in Arabidopsis Seedling as Revealed by Combined Proteomic and Transcriptomic Analyses

10.21203/rs.2.21481/v2 ◽

2020 ◽

Author(s):

Colin Peter Singer Kruse ◽

Alexander D Meyers ◽

Proma Basu ◽

Sarahann Hutchinson ◽

Darron R Luesse ◽

...

Keyword(s):

Gene Expression ◽

Cell Wall ◽

Membrane Proteins ◽

Plant Growth ◽

Gene Transcription ◽

Space Station ◽

Growth Efficiency ◽

Rna Seq ◽

Plastid Gene ◽

The Impact

Abstract Background: Understanding of gravity sensing and response is critical to long-term human habitation in space and can provide new advantages for terrestrial agriculture. To this end, the altered gene expression profile induced by microgravity has been repeatedly queried by microarray and RNA-seq experiments to understand gravitropism. However, the quantification of altered protein abundance in space has been minimally investigated. Results: Proteomic (iTRAQ-labelled LC-MS/MS) and transcriptomic (RNA-seq) analyses simultaneously quantified protein and transcript differential expression of three-day old, etiolated Arabidopsis thaliana seedlings grown aboard the International Space Station along with their ground control counterparts. Protein extracts were fractionated to isolate soluble and membrane proteins and analyzed to detect differentially phosphorylated peptides. In total, 968 RNAs, 107 soluble proteins, and 103 membrane proteins were identified as differentially expressed. In addition, the proteomic analyses identified 16 differential phosphorylation events. Proteomic data delivered novel insights and simultaneously provided new context to previously made observations of gene expression in microgravity. There is a sweeping shift in post-transcriptional mechanisms of gene regulation including RNA-decapping protein DCP5, the splicing factors GRP7 and GRP8, and AGO4,. These data also indicate AHA2 and FERONIA as well as CESA1 and SHOU4 as central to the cell wall adaptations seen in spaceflight. Patterns of tubulin-a 1, 3,4 and 6 phosphorylation further reveal an interaction of microtubule and redox homeostasis that mirrors osmotic response signaling elements. The absence of gravity also results in a seemingly wasteful dysregulation of plastid gene transcription. Conclusions: The datasets gathered from Arabidopsis seedlings exposed to microgravity revealed marked impacts on post-transcriptional regulation, cell wall synthesis, redox/microtubule dynamics, and plastid gene transcription. The impact of post-transcriptional regulatory alterations represents an unstudied element of the plant microgravity response with the potential to significantly impact plant growth efficiency and beyond. What’s more, addressing the effects of microgravity on AHA2, CESA1, and alpha tubulins has the potential to enhance cytoskeletal organization and cell wall composition, thereby enhancing biomass production and growth in microgravity. Finally, understanding and manipulating the dysregulation of plastid gene transcription has further potential to address the goal of enhancing plant growth in the stressful conditions of microgravity.

Download Full-text

IRIS-EDA: An integrated RNA-Seq interpretation system for gene expression data analysis

PLoS Computational Biology ◽

10.1371/journal.pcbi.1006792 ◽

2019 ◽

Vol 15 (2) ◽

pp. e1006792 ◽

Cited By ~ 11

Author(s):

Brandon Monier ◽

Adam McDermaid ◽

Cankun Wang ◽

Jing Zhao ◽

Allison Miller ◽

...

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Gene Expression Data ◽

Expression Data ◽

Rna Seq ◽

Gene Expression Data Analysis ◽

Interpretation System

Download Full-text

Bayesian inference of the gene expression states of single cells from scRNA-seq data

10.1101/2019.12.28.889956 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jérémie Breda ◽

Mihaela Zavolan ◽

Erik van Nimwegen

Keyword(s):

Gene Expression ◽

Single Cell ◽

Single Cells ◽

Downstream Processing ◽

Noise Removal ◽

Rna Seq ◽

Expression Of Genes ◽

Normalization Methods ◽

Quantify Gene Expression ◽

Selection Of

AbstractIn spite of a large investment in the development of methodologies for analysis of single-cell RNA-seq data, there is still little agreement on how to best normalize such data, i.e. how to quantify gene expression states of single cells from such data. Starting from a few basic requirements such as that inferred expression states should correct for both intrinsic biological fluctuations and measurement noise, and that changes in expression state should be measured in terms of fold-changes rather than changes in absolute levels, we here derive a unique Bayesian procedure for normalizing single-cell RNA-seq data from first principles. Our implementation of this normalization procedure, called Sanity (SAmpling Noise corrected Inference of Transcription activitY), estimates log expression values and associated errors bars directly from raw UMI counts without any tunable parameters.Comparison of Sanity with other recent normalization methods on a selection of scRNA-seq datasets shows that Sanity outperforms other methods on basic downstream processing tasks such as clustering cells into subtypes and identification of differentially expressed genes. More importantly, we show that all other normalization methods present severely distorted pictures of the data. By failing to account for biological and technical Poisson noise, many methods systematically predict the lowest expressed genes to be most variable in expression, whereas in reality these genes provide least evidence of true biological variability. In addition, by confounding noise removal with lower-dimensional representation of the data, many methods introduce strong spurious correlations of expression levels with the total UMI count of each cell as well as spurious co-expression of genes.

Download Full-text

Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

10.21203/rs.3.rs-421080/v1 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Refseq Gene ◽

Rna Seq ◽

Sequencing Data ◽

Microarray Expression Data ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Expression Quantification

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

Impact of gene annotation choice on the quantification of RNA-seq data

10.1101/2021.01.07.425794 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Rna Seq ◽

Microarray Expression Data ◽

Refseq Annotation ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Microarray Expression ◽

Expression Quantification

RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from $>$800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.

Download Full-text

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

Genome Biology ◽

10.1186/s13059-021-02568-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Kayla A. Johnson ◽

Arjun Krishnan

Keyword(s):

Gene Expression ◽

Expression Data ◽

Rna Seq ◽

Functional Relationships ◽

Gene Coexpression ◽

Transformation Methods ◽

Network Transformation ◽

Almost All ◽

Coexpression Networks ◽

The Impact

Abstract Background Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data pre-processing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression. Results Here, we present a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. We test these workflows on both large, homogenous datasets and small, heterogeneous datasets from various labs. We analyze the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Our results demonstrate that between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships. Conclusions Based on this work, we provide concrete recommendations on robust procedures for building an accurate coexpression network from an RNA-seq dataset. In addition, researchers can examine all the results in great detail at https://krishnanlab.github.io/RNAseq_coexpression to make appropriate choices for coexpression analysis based on the experimental factors of their RNA-seq dataset.

Download Full-text

Association of novel rare coding variants with juvenile idiopathic arthritis

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-218359 ◽

2021 ◽

pp. annrheumdis-2020-218359

Author(s):

Xinyi Meng ◽

Xiaoyuan Hou ◽

Ping Wang ◽

Joseph T Glessner ◽

Hui-Qi Qu ◽

...

Keyword(s):

Gene Expression ◽

Juvenile Idiopathic Arthritis ◽

Autoimmune Diseases ◽

Rare Variants ◽

Variant Calling ◽

Rna Seq ◽

Common Variants ◽

Association Analyses ◽

Coding Variants ◽

The Impact

ObjectiveJuvenile idiopathic arthritis (JIA) is the most common type of arthritis among children, but a few studies have investigated the contribution of rare variants to JIA. In this study, we aimed to identify rare coding variants associated with JIA for the genome-wide landscape.MethodsWe established a rare variant calling and filtering pipeline and performed rare coding variant and gene-based association analyses on three RNA-seq datasets composed of 228 JIA patients in the Gene Expression Omnibus against different sets of controls, and further conducted replication in our whole-exome sequencing (WES) data of 56 JIA patients. Then we conducted differential gene expression analysis and assessed the impact of recurrent functional coding variants on gene expression and signalling pathway.ResultsBy the RNA-seq data, we identified variants in two genes reported in literature as JIA causal variants, as well as additional 63 recurrent rare coding variants seen only in JIA patients. Among the 44 recurrent rare variants found in polyarticular patients, 10 were replicated by our WES of patients with the same JIA subtype. Several genes with recurrent functional rare coding variants have also common variants associated with autoimmune diseases. We observed immune pathways enriched for the genes with rare coding variants and differentially expressed genes.ConclusionThis study elucidated a novel landscape of recurrent rare coding variants in JIA patients and uncovered significant associations with JIA at the gene pathway level. The convergence of common variants and rare variants for autoimmune diseases is also highlighted in this study.

Download Full-text