scholarly journals On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments

2020 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

Abstract In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated data. Mathematical descriptions of the data generating mechanism in pooled experiments are used to reinforce our interpretations from the empirical and simulation studies. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined. For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of biological samples, an adequate pooling strategy is effective in maintaining the power of testing for DGE for low to medium abundance levels, along with a substantial reduction of the total cost of the experiment.

2020 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

Abstract Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets. Results: The data generating model in pooled experiments is defined mathematically to evaluate the the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined. Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of biological samples, an adequate pooling strategy is effective in maintaining the power of testing for DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize cost and maintain power.


Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

Abstract Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets. Results: The data generating model in pooled experiments is defined mathematically to evaluate the the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined. Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power.


GigaScience ◽  
2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Holly C Beale ◽  
Jacquelyn M Roger ◽  
Matthew A Cattle ◽  
Liam T McKay ◽  
Drew K A Thompson ◽  
...  

Abstract Background The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. Findings In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1–77% of all reads (median [IQR], 3% [3–6%]); duplicate reads constitute 3–100% of mapped reads (median [IQR], 27% [13–43%]); and non-exonic reads constitute 4–97% of mapped, non-duplicate reads (median [IQR], 25% [16–37%]). MEND reads constitute 0–79% of total reads (median [IQR], 50% [30–61%]). Conclusions Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.


2021 ◽  
Author(s):  
Tommer Schwarz ◽  
Toni Boltz ◽  
Kangcheng Hou ◽  
Merel Bot ◽  
Chenda Duan ◽  
...  

Mapping genetic variants that regulate gene expression (eQTLs) in large-scale RNA sequencing (RNA-seq) studies is often employed to understand functional consequences of regulatory variants. However, the high cost of RNA-Seq limits sample size, sequencing depth, and therefore, discovery power. In this work, we demonstrate that, given a fixed budget, eQTL discovery power can be increased by lowering the sequencing depth per sample and increasing the number of individuals sequenced in the assay. We perform RNA-Seq of whole blood tissue across 1490 individuals at low-coverage (5.9 million reads/sample) and show that the effective power is higher than an RNA-Seq study of 570 individuals at high-coverage (13.9 million reads/sample). Next, we leverage synthetic datasets derived from real RNA-Seq data to explore the interplay of coverage and number individuals in eQTL studies, and show that a 10-fold reduction in coverage leads to only a 2.5-fold reduction in statistical power. Our study suggests that lowering coverage while increasing the number of individuals is an effective approach to increase discovery power in RNA-Seq studies.


2018 ◽  
Author(s):  
Eric Reed ◽  
Elizabeth Moses ◽  
Xiaohui Xiao ◽  
Gang Liu ◽  
Joshua Campbell ◽  
...  

AbstractThe need to reduce per sample cost of RNA-seq profiling for scalable data generation has led to the emergence of highly multiplexed RNA-seq. These technologies utilize barcoding of cDNA sequences in order to combine samples into single sequencing lane to be separated during data processing. In this study, we report the performance of one such technique denoted as sparse full length sequencing (SFL), a ribosomal RNA depletion-based RNA sequencing approach that allows for the simultaneous sequencing of 96 samples and higher. We offer comparisons to well established single-sample techniques, including: full coverage Poly-A capture RNA-seq and microarray, as well as another low-cost highly multiplexed technique known as 3’ digital gene expression (3’DGE). Data was generated for a set of exposure experiments on immortalized human lung epithelial (AALE) cells in a two-by-two study design, in which samples received both genetic and chemical perturbations of known oncogenes/tumor suppressors and lung carcinogens. SFL demonstrated improved performance over 3’DGE in terms of coverage, power to detect differential gene expression, and biological recapitulation of patterns of differential gene expression from in vivo lung cancer mutation signatures.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11875
Author(s):  
Tomoko Matsuda

Large volumes of high-throughput sequencing data have been submitted to the Sequencing Read Archive (SRA). The lack of experimental metadata associated with the data makes reuse and understanding data quality very difficult. In the case of RNA sequencing (RNA-Seq), which reveals the presence and quantity of RNA in a biological sample at any moment, it is necessary to consider that gene expression responds over a short time interval (several seconds to a few minutes) in many organisms. Therefore, to isolate RNA that accurately reflects the transcriptome at the point of harvest, raw biological samples should be processed by freezing in liquid nitrogen, immersing in RNA stabilization reagent or lysing and homogenizing in RNA lysis buffer containing guanidine thiocyanate as soon as possible. As the number of samples handled simultaneously increases, the time until the RNA is protected can increase. Here, to evaluate the effect of different lag times in RNA protection on RNA-Seq data, we harvested CHO-S cells after 3, 5, 6, and 7 days of cultivation, added RNA lysis buffer in a time course of 15, 30, 45, and 60 min after harvest, and conducted RNA-Seq. These RNA samples showed high RNA integrity number (RIN) values indicating non-degraded RNA, and sequence data from libraries prepared with these RNA samples was of high quality according to FastQC. We observed that, at the same cultivation day, global trends of gene expression were similar across the time course of addition of RNA lysis buffer; however, the expression of some genes was significantly different between the time-course samples of the same cultivation day; most of these differentially expressed genes were related to apoptosis. We conclude that the time lag between sample harvest and RNA protection influences gene expression of specific genes. It is, therefore, necessary to know not only RIN values of RNA and the quality of the sequence data but also how the experiment was performed when acquiring RNA-Seq data from the database.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

An amendment to this paper has been published and can be accessed via the original article.


2019 ◽  
Vol 97 (Supplement_3) ◽  
pp. 135-135
Author(s):  
Shengfa F Liao ◽  
Shamimul Hasan ◽  
Jean M Feugang

Abstract Animal life essentially is a set of gene expression processes. Thorough understanding of these processes driven by dietary nutrients and other environmental factors can be regarded as a bottom line of modern advanced animal nutrition research for improving animal growth, development, health, production, and reproduction performance. Nutrigenomics, a genome-wide approach using the knowledge and techniques obtained from the disciplines of genomics (including transcriptomics) and molecular biology, is to study the effects of dietary nutrients on cellular gene expression, cellular metabolic responses and, ultimately, the phenotypic changes of a living organism. Transcriptomics can be applied to investigate animal tissue transcriptome at a defined physiological or nutritional state, which provides a holistic view of the intracellular expression of RNA, especially mRNA. As a novel, promising transcriptomics approach, RNA sequencing (RNA-Seq) technology can monitor all-gene expressions simultaneously in response to dietary intervention. The principle and history of RNA-Seq technology will be briefly reviewed, and the three principal steps of this methodology, including the laboratory analysis of tissue samples, the bioinformatics analysis of the generated sequence data, and the subsequent biological interpretation of the data, will be described. The application of RNA-Seq technology in different areas of animal nutrition research, which include maternal nutrition, feeding strategy and gut microbiota, will be summarized. Lastly, the application of RNA-Seq technology in swine science and nutrition research will also be discussed. In short, to further improve animal feeding or production efficiency, RNA-Seq technology holds a great potential to be employed to explore the new insights into better understanding of nutrient-gene interactions in agricultural animals, and it is expected that the application of this cutting-edge technology in animal nutrition research will continue to grow in the foreseeable future. This research was supported in part by a USDA-NIFA Multistate Project (No. 1007691).


2012 ◽  
Vol 30 (30_suppl) ◽  
pp. 56-56
Author(s):  
Byung-In Lee ◽  
Kahuku Oades ◽  
Lien Vo ◽  
Jerry Lee ◽  
Mark Landers ◽  
...  

56 Background: Gene expression profiling has been shown to be effective in analyzing postoperative tumor samples in various cancers. However, in analyzing small specimens such as core biopsies, the limited amount of available material makes multi-gene analyses difficult or impossible. Microarray-based analyses also provide limited dynamic range. We describe the development of targeted RNA-sequencing methodology which combines the power of a universal RNA amplification with NGS for an ultra-deep expression analysis of multiple target genes, enabling <100 ng of sample input for multi-gene analysis in a single tube format. Methods: The gene expression patterns of triple-negative breast cancer FFPE samples were analyzed using a 96-gene breast cancer biomarker panel across three different platforms: Affymetrix Human Gene ST 1.0 microarrays, a pre-developed OncoScore qRT-PCR panel, and targeted RNA-seq. For targeted RNA-seq analysis, the 96-gene panel was amplified using a universal, single-tube “XP-PCR” amplification strategy followed by sequence analysis using the Ion-Torrent Personal Genome Machine. Results: Targeted RNA-seq provided the most sensitivity in terms of detection rates with <100 ng FFPE RNA input and provides unlimited dynamic range with increased sequencing depth. Expression ratio compression issues typically associated with a high number of pre-amplification cycles in standard multiplex-primed methods were not observed here. Low expressing genes, undetectable by qRT-PCR analysis from 1,000 ng input FFPE RNA, were detected and eligible for expression analysis with a significant number of sequencing reads. Alternative transcription/splicing analysis is also possible from sequence analysis of the target transcripts using targeted RNA-seq. Conclusions: By combining universally primed pre-amplification and NGS in multi-gene expression analysis, targeted RNA-seq provides the most sensitive gene expression analysis methodology.


2021 ◽  
Author(s):  
Lin Di ◽  
Bo Liu ◽  
Yuzhu Lyu ◽  
Shihui Zhao ◽  
Yuhong Pang ◽  
...  

Many single cell RNA-seq applications aim to probe a wide dynamic range of gene expression, but most of them are still challenging to accurately quantify low-aboundance transcripts. Based on our previous finding that Tn5 transposase can directly cut-and-tag DNA/RNA hetero-duplexes, we present SHERRY2, an optimized protocol for sequencing transcriptomes of single cells or single nuclei. SHERRY2 is robust and scalable, and it has higher sensitivity and more uniform coverage in comparison with prevalent scRNA-seq methods. With throughput of a few thousand cells per batch, SHERRY2 can reveal the subtle transcriptomic differences between cells and facilitate important biological discoveries.


Sign in / Sign up

Export Citation Format

Share Document