Finding a suitable library size to call variants in RNA-seq

AbstractBackgroundRNA-Seq allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. While several studies analyse the effect that library size has on differential expression analyses, sensitivity analysis for variant detection has received far less attention.ResultsWe simulated shallower sequencing depths by downsampling 45 AML samples that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%). The sensitivity in recovering indels varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort, assessing the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 3% average loss in sensitivity using 40M fragments.ConclusionsBetween 30M and 40M fragments are needed to recover 90%-95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.

Download Full-text

Finding a suitable library size to call variants in RNA-Seq

BMC Bioinformatics ◽

10.1186/s12859-020-03860-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Anna Quaglieri ◽

Christoffer Flensburg ◽

Terence P. Speed ◽

Ian J. Majewski

Keyword(s):

Gene Expression ◽

Acute Myeloid Leukaemia ◽

Myeloid Leukaemia ◽

Cancer Biology ◽

Cancer Type ◽

Rna Seq ◽

Library Size ◽

Single Nucleotide ◽

Average Loss ◽

Acute Myeloid

Abstract Background RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. Results We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments. Conclusions Between 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.

Download Full-text

High heterogeneity undermines generalization of differential expression results in RNA-Seq analysis

Human Genomics ◽

10.1186/s40246-021-00308-5 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Weitong Cui ◽

Huaru Xue ◽

Lei Wei ◽

Jinghua Jin ◽

Xuewen Tian ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Small Sample ◽

Differentially Expressed ◽

Cancer Type ◽

Rna Seq ◽

Sample Sizes ◽

Large Sample ◽

Expression Levels ◽

Gene Expression Levels

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.

Download Full-text

scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Genome Biology ◽

10.1186/s13059-021-02364-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gavin W. Wilson ◽

Mathieu Derouet ◽

Gail E. Darling ◽

Jonathan C. Yeung

Keyword(s):

Genetic Variants ◽

False Positive ◽

Variant Calling ◽

Call Rate ◽

Rna Seq ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Variant Call ◽

Two Samples ◽

Co Detection

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.

Download Full-text

Prediction of genome-wide effects of single nucleotide variants on transcription factor binding

Scientific Reports ◽

10.1038/s41598-020-74793-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Sebastian Carrasco Pro ◽

Katia Bulekova ◽

Brian Gregor ◽

Adam Labadorf ◽

Juan Ignacio Fuxman Bass

Keyword(s):

Binding Sites ◽

Cancer Type ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Regulatory Regions ◽

Genome Wide ◽

Transcriptional Regulatory ◽

Gene Regulatory ◽

The Impact ◽

The Relationship

Abstract Single nucleotide variants (SNVs) located in transcriptional regulatory regions can result in gene expression changes that lead to adaptive or detrimental phenotypic outcomes. Here, we predict gain or loss of binding sites for 741 transcription factors (TFs) across the human genome. We calculated ‘gainability’ and ‘disruptability’ scores for each TF that represent the likelihood of binding sites being created or disrupted, respectively. We found that functional cis-eQTL SNVs are more likely to alter TF binding sites than rare SNVs in the human population. In addition, we show that cancer somatic mutations have different effects on TF binding sites from different TF families on a cancer-type basis. Finally, we discuss the relationship between these results and cancer mutational signatures. Altogether, we provide a blueprint to study the impact of SNVs derived from genetic variation or disease association on TF binding to gene regulatory regions.

Download Full-text

A gain-of-function single nucleotide variant creates a new promoter which acts as an orientation-dependent enhancer-blocker

Nature Communications ◽

10.1038/s41467-021-23980-6 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yavor K. Bozhilov ◽

Damien J. Downes ◽

Jelena Telenius ◽

A. Marieke Oudelaar ◽

Emmanuel N. Olivier ◽

...

Keyword(s):

Gene Expression ◽

Genetic Diseases ◽

Regulatory Elements ◽

Base Change ◽

Dependent Manner ◽

Globin Genes ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Super Enhancer ◽

Single Base Change

AbstractMany single nucleotide variants (SNVs) associated with human traits and genetic diseases are thought to alter the activity of existing regulatory elements. Some SNVs may also create entirely new regulatory elements which change gene expression, but the mechanism by which they do so is largely unknown. Here we show that a single base change in an otherwise unremarkable region of the human α-globin cluster creates an entirely new promoter and an associated unidirectional transcript. This SNV downregulates α-globin expression causing α-thalassaemia. Of note, the new promoter lying between the α-globin genes and their associated super-enhancer disrupts their interaction in an orientation-dependent manner. Together these observations show how both the order and orientation of the fundamental elements of the genome determine patterns of gene expression and support the concept that active genes may act to disrupt enhancer-promoter interactions in mammals as in Drosophila. Finally, these findings should prompt others to fully evaluate SNVs lying outside of known regulatory elements as causing changes in gene expression by creating new regulatory elements.

Download Full-text

Combined Metabolome and Transcriptome Profiling Reveal Optimal Harvest Strategy Model Based on Different Production Purposes in Olive

Foods ◽

10.3390/foods10020360 ◽

2021 ◽

Vol 10 (2) ◽

pp. 360

Author(s):

Guodong Rao ◽

Jianguo Zhang ◽

Xiaoxia Liu ◽

Xue Li ◽

Chenhe Wang

Keyword(s):

Gene Expression ◽

Olive Oil ◽

Expression Patterns ◽

Transcriptome Profiling ◽

Minor Components ◽

Harvest Time ◽

Rna Seq ◽

Optimal Harvest ◽

Harvest Strategy ◽

Different Color

Olive oil has been favored as high-quality edible oil because it contains balanced fatty acids (FAs) and high levels of minor components. The contents of FAs and minor components are variable in olive fruits of different color at harvest time, which render it difficult to determine the optimal harvest strategy for olive oil producing. Here, we combined metabolome, Pacbio Iso-seq, and Illumina RNA-seq transcriptome to investigate the association between metabolites and gene expression of olive fruits at harvest time. A total of 34 FAs, 12 minor components, and 181 other metabolites (including organic acids, polyols, amino acids, and sugars) were identified in this study. Moreover, we proposed optimal olive harvesting strategy models based on different production purposes. In addition, we used the combined Pacbio Iso-seq and Illumina RNA-seq gene expression data to identify genes related to the biosynthetic pathways of hydroxytyrosol and oleuropein. These data lay the foundation for future investigations of olive fruit metabolism and gene expression patterns, and provide a method to obtain olive harvesting strategies for different production purposes.

Download Full-text

Single Nucleotide Variants in Transcription Factors Associate More Tightly with Phenotype than with Gene Expression

PLoS Genetics ◽

10.1371/journal.pgen.1004325 ◽

2014 ◽

Vol 10 (5) ◽

pp. e1004325 ◽

Cited By ~ 10

Author(s):

Priya Sudarsanam ◽

Barak A. Cohen

Keyword(s):

Gene Expression ◽

Transcription Factors ◽

Single Nucleotide Variants ◽

Single Nucleotide

Download Full-text

Gene Expression Patterns, Prognostic and Diagnostic Markers, and Lung Cancer Biology

CHEST Journal ◽

10.1378/chest.125.5_suppl.111s-a ◽

2004 ◽

Vol 125 (5) ◽

pp. 111S-115S ◽

Cited By ~ 11

Author(s):

Naftali Kaminski ◽

Meir Krupsky

Keyword(s):

Gene Expression ◽

Lung Cancer ◽

Cancer Biology ◽

Expression Patterns ◽

Diagnostic Markers ◽

Gene Expression Patterns

Download Full-text

Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00095 ◽

2020 ◽

pp. 160-170

Author(s):

John Vivian ◽

Jordan M. Eizenga ◽

Holly C. Beale ◽

Olena M. Vaske ◽

Benedict Paten

Keyword(s):

Gene Expression ◽

Expression Patterns ◽

Rna Seq ◽

Single Patient ◽

Tissue Samples ◽

Composite Tissue ◽

Patient Sample ◽

Statistical Framework ◽

Therapeutic Leads ◽

Upregulated Genes

PURPOSE Many antineoplastics are designed to target upregulated genes, but quantifying upregulation in a single patient sample requires an appropriate set of samples for comparison. In cancer, the most natural comparison set is unaffected samples from the matching tissue, but there are often too few available unaffected samples to overcome high intersample variance. Moreover, some cancer samples have misidentified tissues of origin or even composite-tissue phenotypes. Even if an appropriate comparison set can be identified, most differential expression tools are not designed to accommodate comparisons to a single patient sample. METHODS We propose a Bayesian statistical framework for gene expression outlier detection in single samples. Our method uses all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set. The consensus distribution can then be used to quantify over- and underexpression. RESULTS We demonstrate this method on both simulated and real gene expression data. We show that it can robustly quantify overexpression, even when the set of comparison samples lacks ideally matched tissue samples. Furthermore, our results show that the method can identify appropriate comparison sets from samples of mixed lineage and rediscover numerous known gene-cancer expression patterns. CONCLUSION This exploratory method is suitable for identifying expression outliers from comparative RNA sequencing (RNA-seq) analysis for individual samples, and Treehouse, a pediatric precision medicine group that leverages RNA-seq to identify potential therapeutic leads for patients, plans to explore this method for processing its pediatric cohort.

Download Full-text