scholarly journals Benchmark of lncRNA Quantification for RNA-Seq of Cancer Samples

2018 ◽  
Author(s):  
Hong Zheng ◽  
Kevin Brennan ◽  
Mikel Hernaez ◽  
Olivier Gevaert

ABSTRACTLong non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. Many lncRNAs with tumor-suppressor or oncogenic functions in cancer have been discovered. While many studies have exploited public resources such as RNA-Seq data in The Cancer Genome Atlas (TCGA) to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification of lncRNAs. In this benchmarking study, we compared the performance of pseudoalignment methods Kallisto and Salmon, and alignment-based methods HTSeq, featureCounts, and RSEM, in lncRNA quantification, by applying them to a simulated RNA-Seq dataset and a pan-cancer RNA-Seq dataset from TCGA. We observed that full transcriptome annotation, including both protein coding and noncoding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment-based methods detect more lncRNAs than alignment-based methods and correlate highly with simulated ground truth. On the contrary, alignment-based methods tend to underestimate lncRNA expression or even fail to capture lncRNA signal in the ground truth. These underestimated genes include cancer-relevant lncRNAs such as TERC and ZEB2-AS1. Overall, 10–16% of lncRNAs can be detected in the samples, with antisense and lincRNAs the two most abundant categories. A higher proportion of antisense RNAs are detected than lincRNAs. Moreover, among the expressed lncRNAs, more antisense RNAs are discordant from ground truth than lincRNAs when measured by alignment-based methods, indicating that antisense RNAs are more susceptible to mis-quantification. In addition, the lncRNAs with fewer transcripts, less than three exons, and lower sequence uniqueness tend to be more discordant. In summary, pseudoalignment methods Kallisto or Salmon in combination with the full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.AUTHOR SUMMARYLong non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. Our benchmarking work on both simulated RNA-Seq dataset and pan-cancer dataset provides timely and useful recommendations for wide research community who are studying lncRNAs, especially for those who are exploring public resources such as TCGA RNA-Seq data. We demonstrate that using full transcriptome annotation in RNA-Seq analysis is strongly recommended as it greatly improves the specificity of lncRNA quantification. What’s more, pseudoalignment methods Kallisto and Salmon outperform alignment-based methods in lncRNA quantification. It is worth noting that the default workflow for TCGA RNA-Seq data stored in Genomic Data Commons (GDC) data portal uses HTSeq, an alignment-based method. Thus, reanalyzing the data might be considered when checking gene expression in TCGA datasets. In summary, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.


GigaScience ◽  
2019 ◽  
Vol 8 (12) ◽  
Author(s):  
Hong Zheng ◽  
Kevin Brennan ◽  
Mikel Hernaez ◽  
Olivier Gevaert

Abstract Background Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. Results In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. Conclusions Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.



2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Fu-Yu Hung ◽  
Chen Chen ◽  
Ming-Ren Yen ◽  
Jo-Wei Allison Hsieh ◽  
Chenlong Li ◽  
...  

Abstract In recent years, eukaryotic long non-coding RNAs (lncRNAs) have been identified as important factors involved in a wide variety of biological processes, including histone modification, alternative splicing and transcription enhancement. The expression of lncRNAs is highly tissue-specific and is regulated by environmental stresses. Recently, a large number of plant lncRNAs have been identified, but very few of them have been studied in detail. Furthermore, the mechanism of lncRNA expression regulation remains largely unknown. Arabidopsis HISTONE DEACETYLASE 6 (HDA6) and LSD1-LIKE 1/2 (LDL1/2) can repress gene expression synergistically by regulating H3Ac/H3K4me. In this research, we performed RNA-seq and ChIP-seq analyses to further clarify the function of HDA6-LDL1/2. Our results indicated that the global expression of lncRNAs is increased in hda6/ldl1/2 and that this increased lncRNA expression is particularly associated with H3Ac/H3K4me2 changes. In addition, we found that HDA6-LDL1/2 is important for repressing lncRNAs that are non-expressed or show low-expression, which may be strongly associated with plant development. GO-enrichment analysis also revealed that the neighboring genes of the lncRNAs that are upregulated in hda6/ldl1/2 are associated with various developmental processes. Collectively, our results revealed that the expression of lncRNAs is associated with H3Ac/H3K4me2 changes regulated by the HDA6-LDL1/2 histone modification complex.



BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jinyu Zhang ◽  
Huanqing Xu ◽  
Yuming Yang ◽  
Xiangqian Zhang ◽  
Zhongwen Huang ◽  
...  

Abstract Background Phosphorus (P) is essential for plant growth and development, and low-phosphorus (LP) stress is a major factor limiting the growth and yield of soybean. Long noncoding RNAs (lncRNAs) have recently been reported to be key regulators in the responses of plants to stress conditions, but the mechanism through which LP stress mediates the biogenesis of lncRNAs in soybean remains unclear. Results In this study, to explore the response mechanisms of lncRNAs to LP stress, we used the roots of two representative soybean genotypes that present opposite responses to P deficiency, namely, a P-sensitive genotype (Bogao) and a P-tolerant genotype (NN94156), for the construction of RNA sequencing (RNA-seq) libraries. In total, 4,166 novel lncRNAs, including 525 differentially expressed (DE) lncRNAs, were identified from the two genotypes at different P levels. GO and KEGG analyses indicated that numerous DE lncRNAs might be involved in diverse biological processes related to phosphate, such as lipid metabolic processes, catalytic activity, cell membrane formation, signal transduction, and nitrogen fixation. Moreover, lncRNA-mRNA-miRNA and lncRNA-mRNA networks were constructed, and the results identified several promising lncRNAs that might be highly valuable for further analysis of the mechanism underlying the response of soybean to LP stress. Conclusions These results revealed that LP stress can significantly alter the genome-wide profiles of lncRNAs, particularly those of the P-sensitive genotype Bogao. Our findings increase the understanding of and provide new insights into the function of lncRNAs in the responses of soybean to P stress.



2018 ◽  
Vol 19 (10) ◽  
pp. 3250 ◽  
Author(s):  
Anna Sorrentino ◽  
Antonio Federico ◽  
Monica Rienzo ◽  
Patrizia Gazzerro ◽  
Maurizio Bifulco ◽  
...  

The PR/SET domain gene family (PRDM) encodes 19 different transcription factors that share a subtype of the SET domain [Su(var)3-9, enhancer-of-zeste and trithorax] known as the PRDF1-RIZ (PR) homology domain. This domain, with its potential methyltransferase activity, is followed by a variable number of zinc-finger motifs, which likely mediate protein–protein, protein–RNA, or protein–DNA interactions. Intriguingly, almost all PRDM family members express different isoforms, which likely play opposite roles in oncogenesis. Remarkably, several studies have described alterations in most of the family members in malignancies. Here, to obtain a pan-cancer overview of the genomic and transcriptomic alterations of PRDM genes, we reanalyzed the Exome- and RNA-Seq public datasets available at The Cancer Genome Atlas portal. Overall, PRDM2, PRDM3/MECOM, PRDM9, PRDM16 and ZFPM2/FOG2 were the most mutated genes with pan-cancer frequencies of protein-affecting mutations higher than 1%. Moreover, we observed heterogeneity in the mutation frequencies of these genes across tumors, with cancer types also reaching a value of about 20% of mutated samples for a specific PRDM gene. Of note, ZFPM1/FOG1 mutations occurred in 50% of adrenocortical carcinoma patients and were localized in a hotspot region. These findings, together with OncodriveCLUST results, suggest it could be putatively considered a cancer driver gene in this malignancy. Finally, transcriptome analysis from RNA-Seq data of paired samples revealed that transcription of PRDMs was significantly altered in several tumors. Specifically, PRDM12 and PRDM13 were largely overexpressed in many cancers whereas PRDM16 and ZFPM2/FOG2 were often downregulated. Some of these findings were also confirmed by real-time-PCR on primary tumors.



PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e8797 ◽  
Author(s):  
Matthew Ung ◽  
Evelien Schaafsma ◽  
Daniel Mattox ◽  
George L. Wang ◽  
Chao Cheng

Background The “dark matter” of the genome harbors several non-coding RNA species including Long non-coding RNAs (lncRNAs), which have been implicated in neoplasia but remain understudied. RNA-seq has provided deep insights into the nature of lncRNAs in cancer but current RNA-seq data are rarely accompanied by longitudinal patient survival information. In contrast, a plethora of microarray studies have collected these clinical metadata that can be leveraged to identify novel associations between gene expression and clinical phenotypes. Methods In this study, we developed an analysis framework that computationally integrates RNA-seq and microarray data to systematically screen 9,463 lncRNAs for association with mortality risk across 20 cancer types. Results In total, we identified a comprehensive list of associations between lncRNAs and patient survival and demonstrate that these prognostic lncRNAs are under selective pressure and may be functional. Our results provide valuable insights that facilitate further exploration of lncRNAs and their potential as cancer biomarkers and drug targets.



PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6388 ◽  
Author(s):  
Asanigari Saleembhasha ◽  
Seema Mishra

Despite years of research, we are still unraveling crucial stages of gene expression regulation in cancer. On the basis of major biological hallmarks, we hypothesized that there must be a uniform gene expression pattern and regulation across cancer types. Among non-coding genes, long non-coding RNAs (lncRNAs) are emerging as key gene regulators playing powerful roles in cancer. Using TCGA RNAseq data, we analyzed coding (mRNA) and non-coding (lncRNA) gene expression across 15 and 9 common cancer types, respectively. 70 significantly differentially expressed genes common to all 15 cancer types were enlisted. Correlating with protein expression levels from Human Protein Atlas, we observed 34 positively correlated gene sets which are enriched in gene expression, transcription from RNA Pol-II, regulation of transcription and mitotic cell cycle biological processes. Further, 24 lncRNAs were among common significantly differentially expressed non-coding genes. Using guilt-by-association method, we predicted lncRNAs to be involved in same biological processes. Combining RNA-RNA interaction prediction and transcription regulatory networks, we identified E2F1, FOXM1 and PVT1 regulatory path as recurring pan-cancer regulatory entity. PVT1 is predicted to interact with SYNE1 at 3′-UTR; DNAJC9, RNPS1 at 5′-UTR and ATXN2L, ALAD, FOXM1 and IRAK1 at CDS sites. The key findings are that through E2F1, FOXM1 and PVT1 regulatory axis and possible interactions with different coding genes, PVT1 may be playing a prominent role in pan-cancer development and progression.



PeerJ ◽  
2016 ◽  
Vol 3 ◽  
pp. e1499 ◽  
Author(s):  
Jordan Anaya ◽  
Brian Reon ◽  
Wei-Min Chen ◽  
Stefan Bekiranov ◽  
Anindya Dutta

Numerous studies have identified prognostic genes in individual cancers, but a thorough pan-cancer analysis has not been performed. In addition, previous studies have mostly used microarray data instead of RNA-SEQ, and have not published comprehensive lists of associations with survival. Using recently available RNA-SEQ and clinical data from The Cancer Genome Atlas for 6,495 patients, we have investigated every annotated and expressed gene’s association with survival across 16 cancer types. The most statistically significant harmful and protective genes were not shared across cancers, but were enriched in distinct gene sets which were shared across certain groups of cancers. These groups of cancers were independently recapitulated by both unsupervised clustering of Cox coefficients (a measure of association with survival) for individual genes, and for gene programs. This analysis has revealed unappreciated commonalities among cancers which may provide insights into cancer pathogenesis and rationales for co-opting treatments between cancers.



2021 ◽  
Author(s):  
jinyu zhang ◽  
Huanqing Xu ◽  
Yuming Yang ◽  
Xiangqian Zhang ◽  
Zhongwen Huang ◽  
...  

Abstract Background: Phosphorus (P) is essential for plant growth and development, and low-phosphorus (LP) stress is a major factor limiting the growth and yield of soybean. Long noncoding RNAs (lncRNAs) have recently been reported to be key regulators in the responses of plants to stress conditions, but the mechanism through which LP stress mediates the biogenesis of lncRNAs in soybean remains unclear.Results: In this study, to explore the response mechanisms of lncRNAs to LP stress, we used the roots of two representative soybean genotypes that present opposite responses to P deficiency, namely, a P-sensitive genotype (Bogao) and a P-tolerant genotype (NN94156), for the construction of RNA sequencing (RNA-seq) libraries. In total, 4,166 novel lncRNAs, including 525 differentially expressed (DE) lncRNAs, were identified from the two genotypes at different P levels. GO and KEGG analyses indicated that numerous DE lncRNAs might be involved in diverse biological processes related to phosphate, such as lipid metabolic processes, catalytic activity, cell membrane formation, signal transduction, and nitrogen fixation. Moreover, lncRNA-mRNA-miRNA and lncRNA-mRNA networks were constructed, and the results identified several promising lncRNAs that might be highly valuable for further analysis of the mechanism underlying the response of soybean to LP stress.Conclusions: These results revealed that LP stress can significantly alter the genome-wide profiles of lncRNAs, particularly those of the P-sensitive genotype Bogao. Our findings increase the understanding of and provide new insights into the function of lncRNAs in the responses of soybean to P stress.



2018 ◽  
Vol 21 (2) ◽  
pp. 395-407 ◽  
Author(s):  
Tony C Y Kuo ◽  
Masaomi Hatakeyama ◽  
Toshiaki Tameshige ◽  
Kentaro K Shimizu ◽  
Jun Sese

Abstract Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.



2020 ◽  
Author(s):  
Xiao-Han Cui ◽  
Qiu-Ju Peng ◽  
Peng Gao ◽  
Xu-Dong Zhang ◽  
Ren-Zhi Li ◽  
...  

Abstract Background: Cancer is one of the most common causes of death, and the morbidity and mortality are gradually increasing in the world. KIF20A plays an important role in tumors, but its immune relevance in pan-cancer needs to be further studied.Methods: KIF20A-related information was download from The Cancer Genome Atlas (TCGA). Collecting RNA-seq data is fragments per kilobase million (FPKM) style data. The ESTIMATE algorithm was used for estimating the stromal and immune scores for 33 tumors. Then, we analyzed the correlation between KIF20A in pan-cancer and immune checkpoints and performed gene set enrichment analysis (GSEA) analysis on the co-expressed genes of KIF20A in pan-cancer.Results: We have confirmed that the expression of KIF20A has a intensive correlation with prognosis in 33 kinds of tumors. Its expression of KIF20A was related to a variety of immune cells and immune checkpoints. Based on the results of GSEA for further analysis, in multiple tumors, KIF20A is related to immune-related pathways.Conclusion: We have demonstrated that KIF20A played an important role in pan-cancer and could affect the occurrence or development of a variety of tumors. Moreover, KIF20A was related to immunity, and KIF20A- related immune research in pan-cancer also needs to be further demonstrate.



Sign in / Sign up

Export Citation Format

Share Document