Transcriptome diversity is a systematic source of bias in RNA-sequencing data

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Machine Learning-Assisted Identification of Factors Contributing to the Technical Variability Between Bulk and Single-Cell RNA-Seq Experiments

10.21203/rs.3.rs-1247889/v1 ◽

2022 ◽

Author(s):

Sofya Lipnitskaya ◽

Yang Shen ◽

Stefan Legewie ◽

Holger Klein ◽

Kolja Becker

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Quantitative Difference ◽

Rna Seq ◽

Sequencing Data ◽

Factors Affecting ◽

Expression Variability ◽

Technical Variability

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.

Download Full-text

OUTRIDER: A statistical method for detecting aberrantly expressed genes in RNA sequencing data

10.1101/322149 ◽

2018 ◽

Cited By ~ 2

Author(s):

Felix Brechtmann ◽

Agnė Matusevičiūtė ◽

Christian Mertes ◽

Vicente A Yépez ◽

Žiga Avsec ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Negative Binomial ◽

Statistical Significance ◽

P Value ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Aberrant Gene Expression ◽

Aberrant Gene

AbstractRNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (OUTlier in RNA-seq fInDER), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read count expectations according to the co-variation among genes resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best correction of artificially corrupted data. Precision–recall analyses using simulated outlier read counts demonstrated the importance of combining correction for co-variation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a data set, for identifying outlier samples with too many aberrantly expressed genes, and for the P-value-based detection of aberrant gene expression, with false discovery rate adjustment. Overall, OUTRIDER provides a computationally fast and scalable end-to-end solution for identifying aberrantly expressed genes, suitable for use by rare disease diagnostic platforms.

Download Full-text

CCSN: Single Cell RNA Sequencing Data Analysis by Conditional Cell-specific Network

10.1101/2020.01.25.919829 ◽

2020 ◽

Author(s):

Lin Li ◽

Hao Dai ◽

Zhaoyuan Fang ◽

Luonan Chen

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Network Flow ◽

Single Cells ◽

Cellular Heterogeneity ◽

Rna Seq ◽

Sequencing Data ◽

Cell Clustering ◽

A Cell

AbstractThe rapid advancement of single cell technologies has shed new light on the complex mechanisms of cellular heterogeneity. However, compared with bulk RNA sequencing (RNA-seq), single-cell RNA-seq (scRNA-seq) suffers from higher noise and lower coverage, which brings new computational difficulties. Based on statistical independence, cell-specific network (CSN) is able to quantify the overall associations between genes for each cell, yet suffering from a problem of overestimation related to indirect effects. To overcome this problem, we propose the “conditional cell-specific network” (CCSN) method, which can measure the direct associations between genes by eliminating the indirect associations. CCSN can be used for cell clustering and dimension reduction on a network basis of single cells. Intuitively, each CCSN can be viewed as the transformation from less “reliable” gene expression to more “reliable” gene-gene associations in a cell. Based on CCSN, we further design network flow entropy (NFE) to estimate the differentiation potency of a single cell. A number of scRNA-seq datasets were used to demonstrate the advantages of our approach: (1) one direct association network for one cell; (2) most existing scRNA-seq methods designed for gene expression matrices are also applicable to CCSN-transformed degree matrices; (3) CCSN-based NFE helps resolving the direction of differentiation trajectories by quantifying the potency of each cell. CCSN is publicly available at http://sysbio.sibcb.ac.cn/cb/chenlab/soft/CCSN.zip.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

RNA-sequencing data-driven dissection of human plasma cell differentiation reveals new potential transcription regulators

Leukemia ◽

10.1038/s41375-021-01234-0 ◽

2021 ◽

Author(s):

Alboukadel Kassambara ◽

Laurie Herviou ◽

Sara Ovejero ◽

Michel Jourdan ◽

Coraline Thibaut ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Continuous Production ◽

Expression Patterns ◽

Gene Clusters ◽

Transcriptional Networks ◽

Pathway Enrichment Analysis ◽

Rna Seq ◽

Sequencing Data

AbstractPlasma cells (PCs) play an important role in the adaptive immune system through a continuous production of antibodies. We have demonstrated that PC differentiation can be modeled in vitro using complex multistep culture systems reproducing sequential differentiation process occurring in vivo. Here we present a comprehensive, temporal program of gene expression data encompassing human PC differentiation (PCD) using RNA sequencing (RNA-seq). Our results reveal 6374 differentially expressed genes classified into four temporal gene expression patterns. A stringent pathway enrichment analysis of these gene clusters highlights known pathways but also pathways largely unknown in PCD, including the heme biosynthesis and the glutathione conjugation pathways. Additionally, our analysis revealed numerous novel transcriptional networks with significant stage-specific overexpression and potential importance in PCD, including BATF2, BHLHA15/MIST1, EZH2, WHSC1/MMSET, and BLM. We have experimentally validated a potent role for BLM in regulating cell survival and proliferation during human PCD. Taken together, this RNA-seq analysis of PCD temporal stages helped identify coexpressed gene modules with associated up/downregulated transcription regulator genes that could represent major regulatory nodes for human PC maturation. These data constitute a unique resource of human PCD gene expression programs in support of future studies for understanding the underlying mechanisms that control PCD.

Download Full-text

RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021255 ◽

2019 ◽

Vol 2 (1) ◽

pp. 139-173 ◽

Cited By ~ 23

Author(s):

Koen Van den Berge ◽

Katharina M. Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Data Sets ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v2 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

Genotype-free individual genome reconstruction of Multiparental Population Models by RNA sequencing data

10.1101/2020.10.11.335323 ◽

2020 ◽

Author(s):

Kwangbom Choi ◽

Hao He ◽

Daniel M. Gatti ◽

Vivek M. Philip ◽

Narayanan Raghupathy ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Gene Expression Regulation ◽

Agricultural Research ◽

Model Systems ◽

Specific Gene ◽

Rna Seq ◽

Sequencing Data ◽

Individual Genome ◽

Genome Reconstruction

AbstractMulti-parent populations (MPPs), genetically segregating model systems derived from two or more inbred founder strains, are widely used in biomedical and agricultural research. Gene expression profiling by direct RNA sequencing (RNA-Seq) is commonly applied to MPPs to investigate gene expression regulation and to identify candidate genes. In genetically diverse populations, including most MPPs, quantification of gene expression is improved when the RNA-Seq reads are aligned to individualized transcriptomes that incorporate known polymorphic loci. However, the process of constructing and analyzing individual genomes can be computationally demanding and error prone. We propose a new approach, genome reconstruction by RNA-Seq (GBRS), that relies on simultaneous alignment of RNA-Seq reads to the founder strain transcriptomes. GBRS can reconstruct the diploid genome of each individual and quantify both total and allele-specific gene expression. We demonstrate that GBRS performs as well as methods that rely on high-density genotyping arrays to reconstruct the founder haplotype mosaic of MPP individuals. Using GBRS in addition to other genotyping methods provides quality control for detecting sample mix-ups and improves power to detect expression quantitative trait loci. GBRS software is freely available at https://github.com/churchill-lab/gbrs.

Download Full-text