Reference genome and annotation updates lead to contradictory prognostic predictions in gene expression signatures: a case study of resected stage I lung adenocarcinoma

Author(s):  
Zheyang Zhang ◽  
Sainan Zhang ◽  
Xin Li ◽  
Zhangxiang Zhao ◽  
Changjing Chen ◽  
...  

Abstract RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20–v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.

Genes ◽  
2021 ◽  
Vol 12 (1) ◽  
pp. 120
Author(s):  
Yiyun Sun ◽  
Dandan Xu ◽  
Chundong Zhang ◽  
Yitao Wang ◽  
Lian Zhang ◽  
...  

We previously demonstrated that proline-rich protein 11 (PRR11) and spindle and kinetochore associated 2 (SKA2) constituted a head-to-head gene pair driven by a prototypical bidirectional promoter. This gene pair synergistically promoted the development of non-small cell lung cancer. However, the signaling pathways leading to the ectopic expression of this gene pair remains obscure. In the present study, we first analyzed the lung squamous cell carcinoma (LSCC) relevant RNA sequencing data from The Cancer Genome Atlas (TCGA) database using the correlation analysis of gene expression and gene set enrichment analysis (GSEA), which revealed that the PRR11-SKA2 correlated gene list highly resembled the Hedgehog (Hh) pathway activation-related gene set. Subsequently, GLI1/2 inhibitor GANT-61 or GLI1/2-siRNA inhibited the Hh pathway of LSCC cells, concomitantly decreasing the expression levels of PRR11 and SKA2. Furthermore, the mRNA expression profile of LSCC cells treated with GANT-61 was detected using RNA sequencing, displaying 397 differentially expressed genes (203 upregulated genes and 194 downregulated genes). Out of them, one gene set, including BIRC5, NCAPG, CCNB2, and BUB1, was involved in cell division and interacted with both PRR11 and SKA2. These genes were verified as the downregulated genes via RT-PCR and their high expression significantly correlated with the shorter overall survival of LSCC patients. Taken together, our results indicate that GLI1/2 mediates the expression of the PRR11-SKA2-centric gene set that serves as an unfavorable prognostic indicator for LSCC patients, potentializing new combinatorial diagnostic and therapeutic strategies in LSCC.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


Lung Cancer ◽  
2020 ◽  
Vol 147 ◽  
pp. 56-63
Author(s):  
Yoshiteru Kidokoro ◽  
Tomohiko Sakabe ◽  
Tomohiro Haruki ◽  
Taichi Kadonaga ◽  
Kanae Nosaka ◽  
...  

2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Qidong Cai ◽  
Boxue He ◽  
Pengfei Zhang ◽  
Zhenyu Zhao ◽  
Xiong Peng ◽  
...  

Abstract Background Alternative splicing (AS) plays critical roles in generating protein diversity and complexity. Dysregulation of AS underlies the initiation and progression of tumors. Machine learning approaches have emerged as efficient tools to identify promising biomarkers. It is meaningful to explore pivotal AS events (ASEs) to deepen understanding and improve prognostic assessments of lung adenocarcinoma (LUAD) via machine learning algorithms. Method RNA sequencing data and AS data were extracted from The Cancer Genome Atlas (TCGA) database and TCGA SpliceSeq database. Using several machine learning methods, we identified 24 pairs of LUAD-related ASEs implicated in splicing switches and a random forest-based classifiers for identifying lymph node metastasis (LNM) consisting of 12 ASEs. Furthermore, we identified key prognosis-related ASEs and established a 16-ASE-based prognostic model to predict overall survival for LUAD patients using Cox regression model, random survival forest analysis, and forward selection model. Bioinformatics analyses were also applied to identify underlying mechanisms and associated upstream splicing factors (SFs). Results Each pair of ASEs was spliced from the same parent gene, and exhibited perfect inverse intrapair correlation (correlation coefficient = − 1). The 12-ASE-based classifier showed robust ability to evaluate LNM status of LUAD patients with the area under the receiver operating characteristic (ROC) curve (AUC) more than 0.7 in fivefold cross-validation. The prognostic model performed well at 1, 3, 5, and 10 years in both the training cohort and internal test cohort. Univariate and multivariate Cox regression indicated the prognostic model could be used as an independent prognostic factor for patients with LUAD. Further analysis revealed correlations between the prognostic model and American Joint Committee on Cancer stage, T stage, N stage, and living status. The splicing network constructed of survival-related SFs and ASEs depicts regulatory relationships between them. Conclusion In summary, our study provides insight into LUAD researches and managements based on these AS biomarkers.


Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance, but with 14% of the memory footprint and 5.5 times the speed.


2021 ◽  
Vol 118 (20) ◽  
pp. e2101056118
Author(s):  
Danang Crysnanto ◽  
Alexander S. Leonard ◽  
Zih-Hua Fang ◽  
Hubert Pausch

Many genomic analyses start by aligning sequencing reads to a linear reference genome. However, linear reference genomes are imperfect, lacking millions of bases of unknown relevance and are unable to reflect the genetic diversity of populations. This makes reference-guided methods susceptible to reference-allele bias. To overcome such limitations, we build a pangenome from six reference-quality assemblies from taurine and indicine cattle as well as yak. The pangenome contains an additional 70,329,827 bases compared to the Bos taurus reference genome. Our multiassembly approach reveals 30 and 10.1 million bases private to yak and indicine cattle, respectively, and between 3.3 and 4.4 million bases unique to each taurine assembly. Utilizing transcriptomes from 56 cattle, we show that these nonreference sequences encode transcripts that hitherto remained undetected from the B. taurus reference genome. We uncover genes, primarily encoding proteins contributing to immune response and pathogen-mediated immunomodulation, differentially expressed between Mycobacterium bovis–infected and noninfected cattle that are also undetectable in the B. taurus reference genome. Using whole-genome sequencing data of cattle from five breeds, we show that reads which were previously misaligned against the Bos taurus reference genome now align accurately to the pangenome sequences. This enables us to discover 83,250 polymorphic sites that segregate within and between breeds of cattle and capture genetic differentiation across breeds. Our work makes a so-far unused source of variation amenable to genetic investigations and provides methods and a framework for establishing and exploiting a more diverse reference genome.


2018 ◽  
Vol 50 (4) ◽  
pp. 1474-1495 ◽  
Author(s):  
Rong-Quan He ◽  
Zhi-Guang Huang ◽  
Tian-Yu Li ◽  
Yan-Ping Wei ◽  
Gang Chen ◽  
...  

Background/Aims: Current practical advances in high-throughput data technologies including RNA-sequencing have led to the identification of long non-coding RNAs (lncRNAs) for potential clinical application against bladder urothelial cancer (BLCA). However, most previous studies focused on the clinical value of individual lncRNAs, which has limited the potential for future clinical application. Methods: In this study, RNA-sequencing data of lncRNAs was downloaded from The Cancer Genome Atlas database. Risk score was constructed based on survival-associated lncRNAs identified using differential expression analysis as well as univariate and multivariate Cox proportional hazards regression analysis. Receiver operating characteristic and Kaplan-Meier curve analyses were employed to evaluate the clinical and prognostic value of risk scores. Bioinformatics analyses were used to investigate the potential mechanisms of newly identified lncRNAs. Results: Among 2,127 differentially expressed lncRNAs (DELs), four new lncRNAs (AC145124.1, AC010168.2, MIR200CHG, and AC098613.1) showed valuable prognostic effects in BLCA patients. More importantly, the four-DEL-based risk score had the potential to become an independent marker for the survival status prediction of BLCA patients. Distinct co-expressed genes and signaling pathways were identified when BLCA was categorized into low- and high-risk groups. Furthermore, a protein-coding gene, HIST4H4 was found only 68 bp from the AC010168.2 DEL. HIST4H4 expression level was evidently up-regulated and positively correlated with AC010168.2 in BLCA patients. Conclusion: This in silico investigation pioneers the future investigation of the utility of prognostic lncRNAs for BLCA.


Sign in / Sign up

Export Citation Format

Share Document