scholarly journals Impact of Variable RNA-Sequencing Depth on Gene Expression Signatures and Target Compound Robustness: Case Study Examining Brain Tumor (Glioma) Disease Progression

2018 ◽  
pp. 1-17 ◽  
Author(s):  
Alexey Stupnikov ◽  
Paul G. O’Reilly ◽  
Caitriona E. McInerney ◽  
Aideen C. Roddy ◽  
Philip D. Dunne ◽  
...  

Purpose Gene expression profiling can uncover biologic mechanisms underlying disease and is important in drug development. RNA sequencing (RNA-seq) is routinely used to assess gene expression, but costs remain high. Sample multiplexing reduces RNA-seq costs; however, multiplexed samples have lower cDNA sequencing depth, which can hinder accurate differential gene expression detection. The impact of sequencing depth alteration on RNA-seq–based downstream analyses such as gene expression connectivity mapping is not known, where this method is used to identify potential therapeutic compounds for repurposing. Methods In this study, published RNA-seq profiles from patients with brain tumor (glioma) were assembled into two disease progression gene signature contrasts for astrocytoma. Available treatments for glioma have limited effectiveness, rendering this a disease of poor clinical outcome. Gene signatures were subsampled to simulate sequencing alterations and analyzed in connectivity mapping to investigate target compound robustness. Results Data loss to gene signatures led to the loss, gain, and consistent identification of significant connections. The most accurate gene signature contrast with consistent patient gene expression profiles was more resilient to data loss and identified robust target compounds. Target compounds lost included candidate compounds of potential clinical utility in glioma (eg, suramin, dasatinib). Lost connections may have been linked to low-abundance genes in the gene signature that closely characterized the disease phenotype. Consistently identified connections may have been related to highly expressed abundant genes that were ever-present in gene signatures, despite data reductions. Potential noise surrounding findings included false-positive connections that were gained as a result of gene signature modification with data loss. Conclusion Findings highlight the necessity for gene signature accuracy for connectivity mapping, which should improve the clinical utility of future target compound discoveries.

Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 805-805
Author(s):  
Carolina Terragna ◽  
Daniel Remondini ◽  
Sandra Durante ◽  
Marina Martello ◽  
Francesca Patriarca ◽  
...  

Abstract Abstract 805FN2 Background. Achievement of CR is generally associated with improved clinical outcomes for patients (pts) with MM and represents a primary endpoint of current clinical trials. The GIMEMA Italian Myeloma Network designed a phase 3 study to demonstrate that the triplet VTD regimen was superior over a doublet such as thalidomide-dexamethasone (TD) as induction therapy prior to double ASCT for newly diagnosed MM. On an intention-to-treat basis, the rate of complete or near complete response (CR/nCR) was 31% for the 236 pts on VTD induction therapy, while it was 11% (p<0.0001) for the 238 pts on TD induction therapy. Since enhanced rates of CR/nCR affected by VTD incorporated into ASCT resulted in extended progression-free survival, prediction of CR by pharmacogenomic tools is likely to be an important goal to prospectively select those pts who are more likely to benefit from a given therapy. Methods. For this purpose, in a molecular substudy to the main clinical study we assessed the ability of gene expression profile (GEP) to predict attainment of CR/nCR in 122 pts enrolled in the VTD arm of the study. Their characteristics at baseline, including cytogenetic abnormalities, were comparable with those of the whole population of 236 pts. Highly purified CD138+ plasma cells were obtained at diagnosis from each of these pts and were profiled for gene expression using the Affymetrix U133 Plus2.0 platform. In order to build a low-dimensional signature with optimal performance, genomic data were analyzed with an original algorithm that exploits quadratic discriminant analysis with a bottom-up approach that builds N-gene signatures starting from two-dimensional signatures. Gene models were applied to test datasets to predict achievement of either CR/nCR or less than nCR, and classification performances were validated by a leave-one-out crossvalidation procedure. Results. Thirty four pts out of the 122 (28%) who were included in the present analysis achieved a CR/nCR, while the remaining 88 patients failed this objective. The molecular approach described above allowed to identify several gene signatures among which we choose a 163-gene signature that provided a predictive capability of 79% sensitivity, 87% specificity, 71% positive predictive value (PPV) and 92% negative predictive value (NPV). These expression values were used in an unsupervised hierarchical clustering to stratify the population of 122 profilated pts into 3 well defined subgroups. Seventy nine pts were included in subgroup A, while the remaining 43 pts were included in either subgroup B (n=22) or subgroup C (n=21). Notably, 19 out the 34 CR/nCR pts (56%) clustered in subgroup B, whereas the remaining 15 pts were randomly distributed within subgroup A. Analysis of demographic and disease characteristics of the pts belonging to the 3 major subgroups, revealed that in subgroup B the frequencies of pts carrying del(13q) (78%) or del(17p) (22%) or with an IgA isotype (54%) were significantly higher in comparison with the corresponding values found in subgroup A (47%, 4%, and 10%, respectively) and subgroup C (38%, 10%, and 5%, respectively). In order to obtain a more feasible set of genes predictive of CR/nCR, several smaller signatures originating from the 163-gene signature were further analyzed by means of the same algorithm described above. The best predictive capability was obtained with a 41-gene signature that provided 88% sensitivity, 97% specificity, 91% PPV and 95% NPV. A GeneGo ® network analysis of genes included in the signatures showed that the most relevant network nodes included tumour suppressor genes (FBXW7 and MAD), genes involved in inflammatory response (TREM1 and TLR4) and genes involved in B cell development (IKZF1, IL10 and NFAM1). Genes included in the signatures do not gather in specific chromosomes, thus confirming the absence of bias on selection of signatures genes, potentially due to prevalence of MM typical chromosomal aberrations. Conclusions. GEP analysis of a subgroup of pts who received VTD induction therapy allowed to provide a 41-gene signature that was able to predict attainment of CR/nCR and, conversely, failure to achieve at least nCR in 91% and 95% of cases, respectively. These favorable results might represent a first step towards the possible application of a tailored therapy based on the single patient's genetic background. Supported by: Fondazione Del Monte di Bologna e Ravenna, Ateneo RFO grants (M.C.) BolognAIL. Disclosures: Bringhen: Celgene: Honoraria; Janssen-Cilag: Honoraria; Novartis: Honoraria; Merck Sharp & Dhome: Membership on an entity's Board of Directors or advisory committees. Offidani:Janssen: Honoraria; Celgene: Honoraria.


Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

Abstract Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets. Results: The data generating model in pooled experiments is defined mathematically to evaluate the the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined. Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power.


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 1305-1305
Author(s):  
Kirk Cahill ◽  
Linchen Wang ◽  
Guanghao Liang ◽  
Qiancheng You ◽  
Chuanyuan Chen ◽  
...  

Abstract Introduction Acute myeloid leukemia (AML) is an aggressive disease with genetic and phenotypic heterogeneity that results in a highly variable response to standard chemotherapy. Azacitidine (AZA) is a hypomethylating agent (HMA) and has been investigated in combination with intensive chemotherapy as an epigenetic primer to sensitize leukemic cells to treatment. In a phase 1 trial, this regimen was safe and well-tolerated with overall response rate (CR+CRi) of 61% and complete remission rate of 41% (Cahill et al, Blood Adv 2020). Predictive biomarkers for response to this treatment strategy have not yet been identified. Since 5-hydroxymethylcytosine (5hmC) is an epigenetic biomarker in cancer, we hypothesized that Nano-5hmC-Seal sequencing technology may serve as a novel approach to identifying 5hmC profiles predictive of treatment response to epigenetic priming. Methods We performed RNA-seq gene expression and Nano-5hmC-Seal DNA profiling from peripheral blood/bone marrow samples of patients with high-risk AML to identify potential 5hmC profile biomarkers and gene expression changes (Figure 1A). Patients (n=46) were treated in a 3+3 dose-escalation scheme of AZA (37.5 mg/m 2, 50 mg/m 2, or 75 mg/m 2) on days 1-5 followed by high-dose cytarabine (3000 mg/m 2) and mitoxantrone (30 mg/m 2) (AZA-HiDAC-Mito) on day 6 and day 10 in a phase 1 trial previously reported (Cahill et al, Blood Adv 2020). We compared pre-treatment RNA-seq gene expression and 5hmC DNA profiles between responders (CR+CRi) and non-responders, as well as between pre-treatment and after 5 days of AZA for individual patients. We used an XGBoost machine learning model in Python based on a training set of patients to develop a 5hmC gene signature to predict response to AZA-HiDAC-Mito in an independent test set of patients. We compared continuous variables with two-tailed Student's t-test and used the Kaplan-Meier method with log-rank test for survival analysis. Results Thirty-three patients (72%) had adequate RNA samples for RNA-seq gene expression analysis. Eighteen responded to treatment (CR +CRi) and were enriched with gene expression patterns involved in cell-cell interaction and activation of cell cycle, while non-responders (n=15) had a higher expression of leukemic stem cell (LSC) signatures. There was no difference in gene expression profile when comparing pre-treatment samples to day 5 samples after AZA exposure. From the 5hmC profiling [n=40 (87%) patients with adequate samples], increased 5hmC in LSC genes was associated with treatment resistance to AZA-HiDAC-Mito (p=0.044). The number of differentially hydroxy-methylated genes (DhMGs) increased with higher doses of AZA exposure suggesting a dose-dependent epigenetic effect from AZA. Patients with a greater number of DhMGs following 5 days of AZA treatment had improved survival (p=0.015) (Figure 1B). Using the 5hmC-based XGBoost machine learning model comparing 5hmC profiles between responders to non-responders from a training set of patients (n=22), we developed an 11-gene 5hmC pre-treatment signature (including SKP1, WNT8A, CYP2E1, and NBPF9) to predict treatment response. The model was highly effective in predicting response to therapy, with an area under the curve (AUC) of 0.86 in an independent test set of patients (n=18) treated with AZA-HiDAC-Mito (Figure 1C). Conclusion In patients with AML treated with AZA-HiDAC-Mito, a pre-treatment LSC gene expression signature enriched with 5hmC was associated with treatment resistance. More DhMGs at day 5 appear to be a dose-dependent epigenetic effect that is induced by AZA and is associated with longer survival despite the absence of an immediate change in gene expression levels. An 11-gene 5hmC pre-treatment signature may be a predictive biomarker for AZA-HiDAC-Mito therapy and other HMA-based approaches. These findings warrant validation in a larger prospective trial. Figure 1 Figure 1. Disclosures Zhang: Bristol-Myers Squibb: Current Employment. Stock: Pfizer: Consultancy, Honoraria, Research Funding; amgen: Honoraria; agios: Honoraria; jazz: Honoraria; kura: Honoraria; kite: Honoraria; morphosys: Honoraria; servier: Honoraria; syndax: Consultancy, Honoraria; Pluristeem: Consultancy, Honoraria. Odenike: Celgene, Incyte, AstraZeneca, Astex, NS Pharma, AbbVie, Gilead, Janssen, Oncotherapy, Agios, CTI/Baxalta, Aprea: Research Funding; AbbVie, Celgene, Impact Biomedicines, Novartis, Taiho Oncology, Takeda: Consultancy. He: Epican Genetech: Current holder of individual stocks in a privately-held company, Current holder of stock options in a privately-held company.


2017 ◽  
Author(s):  
Andrew Dhawan ◽  
Alessandro Barberis ◽  
Wei-Chen Cheng ◽  
Enric Domingo ◽  
Catharine West ◽  
...  

AbstractWith the increase in next generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools, poised to make a large impact on the diagnosis, management and prognosis for a number of diseases. Increasingly, it is becoming necessary to determine whether a gene expression signature may apply to a dataset, but no standard quality control methodology exists. In this work, we introduce the first protocol, implemented in an R package sigQC, enabling a streamlined methodological and standardised approach for the quality control validation of gene signatures on independent data sets. The emphasis in this work is in showing the critical quality control steps involved in the generation of a clinically and biologically useful, transportable gene signature, including ensuring sufficient expression, variability, and autocorrelation of a signature. We demonstrate the application of the protocol in this work, showing how the outputs created from sigQC may be used for the evaluation of gene signatures on large-scale gene expression data in cancer.


2019 ◽  
Author(s):  
Yuumi Okuzono ◽  
Takashi Hoshino

AbstractRecent rise of microarray and next-generation sequencing in genome-related fields has simplified obtaining gene expression data at whole gene level, and biological interpretation of gene signatures related to life phenomena and diseases has become very important. However, the conventional method is numerical comparison of gene signature, pathway, and gene ontology (GO) overlap and distribution bias, and it is not possible to compare the specificity and importance of genes contained in gene signatures as humans do.This study proposes the gene signature vector (GsVec), a unique method for interpreting gene signatures that clarifies the semantic relationship between gene signatures by incorporating a method of distributed document representation from natural language processing (NLP). In proposed algorithm, a gene-topic vector is created by multiplying the feature vector based on the gene’s distributed representation by the probability of the gene signature topic and the low frequency of occurrence of the corresponding gene in all gene signatures. These vectors are concatenated for genes included in each gene signature to create a signature vector. The degrees of similarity between signature vectors are obtained from the cosine distances, and the levels of relevance between gene signatures are quantified.Using the above algorithm, GsVec learned approximately 5,000 types of canonical pathway and GO biological process gene signatures published in the Molecular Signatures Database (MSigDB). Then, validation of the pathway database BioCarta with known biological significance and validation using actual gene expression data (differentially expressed genes) were performed, and both were able to obtain biologically valid results. In addition, the results compared with the pathway enrichment analysis in Fisher’s exact test used in the conventional method resulted in equivalent or more biologically valid signatures. Furthermore, although NLP is generally developed in Python, GsVec can execute the entire process in only the R language, the main language of bioinformatics.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2124 ◽  
Author(s):  
Iman Rezaeian ◽  
Eliseos J. Mucaki ◽  
Katherina Baranova ◽  
Huy Q. Pham ◽  
Dimo Angelov ◽  
...  

Genomic aberrations and gene expression-defined subtypes in the large METABRIC patient cohort have been used to stratify and predict survival. The present study used normalized gene expression signatures of paclitaxel drug response to predict outcome for different survival times in METABRIC patients receiving hormone (HT) and, in some cases, chemotherapy (CT) agents. This machine learning method, which distinguishes sensitivity vs. resistance in breast cancer cell lines and validates predictions in patients, was also used to derive gene signatures of other HT  (tamoxifen) and CT agents (methotrexate, epirubicin, doxorubicin, and 5-fluorouracil) used in METABRIC. Paclitaxel gene signatures exhibited the best performance, however the other agents also predicted survival with acceptable accuracies. A support vector machine (SVM) model of paclitaxel response containing the ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, TUBB4B genes was 78.6% accurate in 84 patients treated with both HT and CT (median survival ≥ 4.4 yr). Accuracy was lower (73.4%) in 304 untreated patients. The performance of other machine learning approaches were also evaluated at different survival thresholds. Minimum redundancy maximum relevance feature selection of a paclitaxel-based SVM classifier based on expression of ABCB11, ABCC1, BAD, BBC3 and BCL2L1 was 79% accurate in 53 CT patients. A random forest (RF) classifier produced a gene signature (ABCB11, ABCC1, BAD, BCL2, CYP2C8, CYP3A4, MAP4, MAPT, NR1I2, TUBB1, GBP1, OPRK1) that predicted >3 year survival with 82.4% accuracy in 420 HT patients. A similar RF gene signature showed 79.6% accuracy in 504 patients treated with CT and/or HT. These results suggest that tumor gene expression signatures refined by machine learning techniques can be useful for predicting survival after drug therapies.


Blood ◽  
2008 ◽  
Vol 112 (11) ◽  
pp. 1442-1442
Author(s):  
Damian Silbermins ◽  
Laura M. De Castro ◽  
Jude C Jonassaint ◽  
Shiaowen David Hsu ◽  
Marilyn J. Telen ◽  
...  

Abstract Pulmonary artery hypertension (PAH) occurs in 30–50% of adult patients with sickle cell disease (SCD), with mortality ranging from 16 to 50% and a median survival of 25 months. Our objective was to use gene expression profiling to develop a gene signature predictor for PAH through the analysis of gene expression of blood cells from SCD patients with or without PAH. We hypothesized that these gene signatures could allow us to identify patients at risk for PAH, as well as to generate hypotheses as to the pathophysiology of PAH in SCD. We used Affymetrix U133A2 GeneChip to determine the RNA expression of both whole blood and leukocytes using PAXgene and Leukolock methods, respectively. The study population included patients homozygous for HbS or with HbSβ0 thalassemia. Subjects with PAH were ≥18 years old, in steady state, and had PAH either by 2D echo (TR jet ≥ 2.7 m/sec) or right-sided catheterization (mean PA pressure ≥ 30 mmHg). Patients were excluded if they were pregnant, had co-existing rheumatologic conditions or other inflammatory diseases, were on chronic transfusion therapy or had had a vaso-occlusive episode in the previous 4 weeks. The control subjects were patients with SCD but without PAH (TR jet ≤ 1.8 m/sec or mean PA pressure &lt;25 mmHg). Hierarchical clustering based on the gene expression pattern from 7 patients with PAH and 6 controls showed a trend for the clustering of SCD patients with PAH away from SCD patients without PAH. This trend was present for the gene expression in both whole blood and leukocytes. A Bayesian regression analysis was then performed to identify a set of predictor gene signatures for the PAH phenotype (Figure 1) in SCD. Finally, using gene set enrichment analysis, we found that the leukocytes from patients with PAH were highly enriched in the gene sets deriving from hematopoietic stem cells, corroborating the hypothesis of hyperhemolysis and higher blood cell turnover in this population. Other pathways showing upregulation in PAH were PTEN, TGFβ, cyclin D1, WNT and PPAR. Although these data are preliminary, they suggest that PAH in SCD does indeed have a distinct gene signature profile that may become useful in identifying risk for PAH prospectively, as well as in directing further investigation into the pathogenesis of PAH in SCD. Figure Figure


2017 ◽  
Vol 35 (15_suppl) ◽  
pp. e17527-e17527
Author(s):  
Sandra Seby ◽  
Michael R. Rossi ◽  
Kelly R. Magliocca ◽  
Mihir Patel ◽  
Christopher C Griffith ◽  
...  

e17527 Background: Whole human exome sequencing (WES) has identified well characterized somatic mutations (such as TP53, CDKN2A, PIK3CA and HRAS) in patients with squamous cell carcinoma of the head and neck (SCCHN). We sought to optimize a combined RNA-seq and WES approach for identifying actionable mutations and gene expression signatures in p16 + versus - OPSCC. Methods: Relying on formalin fixed and paraffin embedded (FFPE) samples we applied a minimal mutation and copy number content (151 genes) on DNA and an extensive RNA panel on a total of 27 OPSCC (22 p16 +, and 5 p16 -). SAMSeq was used to identify the differentially expressed genes. Unsupervised hierarchical clustering of the TCGA OPSCC samples with available p16 status (n = 31) was performed for external validation of the results. Statistical significance was further tested by Fisher’s exact test. Results: We identified a gene signature differentially expressed in p16+ and p16- OPSCC. External validation showed a significant association between gene expression and p16 status (P = 0.00033). We did not however find an association with mutation burden and smoking history. A number of pathways associated with this gene signature such as NCAM1 may have relevant biologic implications in OPSCC. Conclusions: Our results underscore the reliability of integrating data from FFPE samples in distinguishing gene signatures characteristic of p16 + versus p16- OPSCC; these signatures need to be further explored for their biologic relevance in OPSCC (This research was supported by a grant NCI R21 CA182661-01A1to NFS and GZC).


2018 ◽  
Vol 20 (suppl_5) ◽  
pp. v358-v358
Author(s):  
Dr Alexey Stupnikov ◽  
Dr Caitríona McInerney ◽  
Dr Paul O’reilly ◽  
Aideen Roddy ◽  
Dr Philip Dunne ◽  
...  

GigaScience ◽  
2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Holly C Beale ◽  
Jacquelyn M Roger ◽  
Matthew A Cattle ◽  
Liam T McKay ◽  
Drew K A Thompson ◽  
...  

Abstract Background The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. Findings In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1–77% of all reads (median [IQR], 3% [3–6%]); duplicate reads constitute 3–100% of mapped reads (median [IQR], 27% [13–43%]); and non-exonic reads constitute 4–97% of mapped, non-duplicate reads (median [IQR], 25% [16–37%]). MEND reads constitute 0–79% of total reads (median [IQR], 50% [30–61%]). Conclusions Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.


Sign in / Sign up

Export Citation Format

Share Document