Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples

Abstract Background Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. Results In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. Conclusions Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.

Download Full-text

Benchmark of lncRNA Quantification for RNA-Seq of Cancer Samples

10.1101/241869 ◽

2018 ◽

Author(s):

Hong Zheng ◽

Kevin Brennan ◽

Mikel Hernaez ◽

Olivier Gevaert

Keyword(s):

Ground Truth ◽

The Cancer Genome Atlas ◽

Biological Processes ◽

Rna Seq ◽

Lncrna Expression ◽

Antisense Rnas ◽

Public Resources ◽

Non Coding Rnas ◽

Pan Cancer ◽

Expression Quantification

ABSTRACTLong non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. Many lncRNAs with tumor-suppressor or oncogenic functions in cancer have been discovered. While many studies have exploited public resources such as RNA-Seq data in The Cancer Genome Atlas (TCGA) to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification of lncRNAs. In this benchmarking study, we compared the performance of pseudoalignment methods Kallisto and Salmon, and alignment-based methods HTSeq, featureCounts, and RSEM, in lncRNA quantification, by applying them to a simulated RNA-Seq dataset and a pan-cancer RNA-Seq dataset from TCGA. We observed that full transcriptome annotation, including both protein coding and noncoding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment-based methods detect more lncRNAs than alignment-based methods and correlate highly with simulated ground truth. On the contrary, alignment-based methods tend to underestimate lncRNA expression or even fail to capture lncRNA signal in the ground truth. These underestimated genes include cancer-relevant lncRNAs such as TERC and ZEB2-AS1. Overall, 10–16% of lncRNAs can be detected in the samples, with antisense and lincRNAs the two most abundant categories. A higher proportion of antisense RNAs are detected than lincRNAs. Moreover, among the expressed lncRNAs, more antisense RNAs are discordant from ground truth than lincRNAs when measured by alignment-based methods, indicating that antisense RNAs are more susceptible to mis-quantification. In addition, the lncRNAs with fewer transcripts, less than three exons, and lower sequence uniqueness tend to be more discordant. In summary, pseudoalignment methods Kallisto or Salmon in combination with the full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.AUTHOR SUMMARYLong non-coding RNAs (lncRNAs) emerge as important regulators of various biological processes. Our benchmarking work on both simulated RNA-Seq dataset and pan-cancer dataset provides timely and useful recommendations for wide research community who are studying lncRNAs, especially for those who are exploring public resources such as TCGA RNA-Seq data. We demonstrate that using full transcriptome annotation in RNA-Seq analysis is strongly recommended as it greatly improves the specificity of lncRNA quantification. What’s more, pseudoalignment methods Kallisto and Salmon outperform alignment-based methods in lncRNA quantification. It is worth noting that the default workflow for TCGA RNA-Seq data stored in Genomic Data Commons (GDC) data portal uses HTSeq, an alignment-based method. Thus, reanalyzing the data might be considered when checking gene expression in TCGA datasets. In summary, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.

Download Full-text

Emerging Roles of Estrogen-Regulated Enhancer and Long Non-Coding RNAs

International Journal of Molecular Sciences ◽

10.3390/ijms21103711 ◽

2020 ◽

Vol 21 (10) ◽

pp. 3711

Author(s):

Melina J. Sedano ◽

Alana L. Harrison ◽

Mina Zilaie ◽

Chandrima Das ◽

Ramesh Choudhari ◽

...

Keyword(s):

Rna Sequencing ◽

Expression Patterns ◽

Biological Significance ◽

Rna Seq ◽

Biological Functions ◽

Protein Coding ◽

Rna Molecules ◽

Non Coding Rna ◽

Genome Wide ◽

Non Coding Rnas

Genome-wide RNA sequencing has shown that only a small fraction of the human genome is transcribed into protein-coding mRNAs. While once thought to be “junk” DNA, recent findings indicate that the rest of the genome encodes many types of non-coding RNA molecules with a myriad of functions still being determined. Among the non-coding RNAs, long non-coding RNAs (lncRNA) and enhancer RNAs (eRNA) are found to be most copious. While their exact biological functions and mechanisms of action are currently unknown, technologies such as next-generation RNA sequencing (RNA-seq) and global nuclear run-on sequencing (GRO-seq) have begun deciphering their expression patterns and biological significance. In addition to their identification, it has been shown that the expression of long non-coding RNAs and enhancer RNAs can vary due to spatial, temporal, developmental, or hormonal variations. In this review, we explore newly reported information on estrogen-regulated eRNAs and lncRNAs and their associated biological functions to help outline their markedly prominent roles in estrogen-dependent signaling.

Download Full-text

Homeolog expression quantification methods for allopolyploids

Briefings in Bioinformatics ◽

10.1093/bib/bby121 ◽

2018 ◽

Vol 21 (2) ◽

pp. 395-407 ◽

Cited By ~ 6

Author(s):

Tony C Y Kuo ◽

Masaomi Hatakeyama ◽

Toshiaki Tameshige ◽

Kentaro K Shimizu ◽

Jun Sese

Keyword(s):

Sequence Similarity ◽

Diploid Species ◽

Ground Truth ◽

Error Rates ◽

Rna Seq ◽

Gene Copies ◽

The Difference ◽

Downstream Analysis ◽

High Level ◽

Expression Quantification

Abstract Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.

Download Full-text

Possible Human Papillomavirus 38 Contamination of Endometrial Cancer RNA Sequencing Samples in The Cancer Genome Atlas Database

Journal of Virology ◽

10.1128/jvi.00822-15 ◽

2015 ◽

Vol 89 (17) ◽

pp. 8967-8973 ◽

Cited By ~ 14

Author(s):

Majid Kazemian ◽

Min Ren ◽

Jian-Xin Lin ◽

Wei Liao ◽

Rosanne Spolski ◽

...

Keyword(s):

Endometrial Cancer ◽

Human Papillomavirus ◽

Rna Sequencing ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Clinical Samples ◽

Cross Contamination ◽

Rna Seq ◽

Cancer Genome Atlas ◽

Genome Atlas

ABSTRACTViruses are causally associated with a number of human malignancies. In this study, we sought to identify new virus-cancer associations by searching RNA sequencing data sets from >2,000 patients, encompassing 21 cancers from The Cancer Genome Atlas (TCGA), for the presence of viral sequences. In agreement with previous studies, we found human papillomavirus 16 (HPV16) and HPV18 in oropharyngeal cancer and hepatitis B and C viruses in liver cancer. Unexpectedly, however, we found HPV38, a cutaneous form of HPV associated with skin cancer, in 32 of 168 samples from endometrial cancer. In 12 of the HPV38-positive (HPV38+) samples, we observed at least one paired read that mapped to both human and HPV38 genomes, indicative of viral integration into the host DNA, something not previously demonstrated for HPV38. The expression levels of HPV38 transcripts were relatively low, and all 32 HPV38+samples belonged to the same experimental batch of 40 samples, whereas none of the other 128 endometrial carcinoma samples were HPV38+, raising doubts about the significance of the HPV38 association. Moreover, the HPV38+samples contained the same 10 novel single nucleotide variations (SNVs), leading us to hypothesize that one patient was infected with this new isolate of HPV38, which was integrated into his/her genome and may have cross-contaminated other TCGA samples within batch 228. Based on our analysis, we propose guidelines to examine the batch effect, virus expression level, and SNVs as part of next-generation sequencing (NGS) data analysis for evaluating the significance of viral/pathogen sequences in clinical samples.IMPORTANCEHigh-throughput RNA sequencing (RNA-Seq), followed by computational analysis, has vastly accelerated the identification of viral and other pathogenic sequences in clinical samples, but cross-contamination during the processing of the samples remain a major problem that can lead to erroneous conclusions. We found HPV38 sequences specifically present in RNA-Seq samples from endometrial cancer patients from TCGA, a virus not previously associated with this type of cancer. However, multiple lines of evidence suggest possible cross-contamination in these samples, which were processed together in the same batch. Despite this potential cross-contamination, our data indicate that we have detected a new isolate of HPV38 that appears to be integrated into the human genome. We also provide general guidelines for computational detection and interpretation of pathogen-disease associations.

Download Full-text

Exact transcript quantification over splice graphs

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00184-7 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Cong Ma ◽

Hongyu Zheng ◽

Carl Kingsford

Keyword(s):

Transcriptome Assembly ◽

Generation Model ◽

Rna Seq ◽

New Approach ◽

Transcript Quantification ◽

Splice Graph ◽

Quantification Model ◽

Splice Junctions ◽

Model Graph ◽

Expression Quantification

Abstract Background The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447–55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. Results We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. Conclusion We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis.

Download Full-text

Nanopore sequencing of RNA and cDNA molecules expands the transcriptomic toolbox in prokaryotes

10.1101/2021.06.14.448286 ◽

2021 ◽

Author(s):

Felix Gruenberger ◽

Sebastien Ferreira-Cerca ◽

Dina Grohmann

Keyword(s):

Rna Sequencing ◽

Single Molecule ◽

High Throughput Sequencing ◽

Model Organism ◽

Cost Effective ◽

Rna Seq ◽

Sequencing Platform ◽

Transcript Quantification ◽

Oxford Nanopore ◽

Bacterial Model

High-throughput sequencing dramatically changed our view of transcriptome architectures and allowed for ground-breaking discoveries in RNA biology. Recently, sequencing of full-length transcripts based on the single-molecule sequencing platform from Oxford Nanopore Technologies (ONT) was introduced and is widely employed to sequence eukaryotic and viral RNAs. However, experimental approaches implementing this technique for prokaryotic transcriptomes remain scarce. Here, we present an experimental and bioinformatic workflow for ONT RNA-seq in the bacterial model organism Escherichia coli, which can be applied to any microorganism. Our study highlights critical steps of library preparation and computational analysis and compares the results to gold standards in the field. Furthermore, we comprehensively evaluate the applicability and advantages of different ONT-based RNA sequencing protocols, including direct RNA, direct cDNA, and PCR-cDNA. We find that cDNA-seq offers improved yield and accuracy without bias in quantification compared to direct RNA sequencing. Notably, cDNA-seq can be readily used for simultaneous transcript quantification, accurate detection of transcript 5 ′ and 3′ boundaries, analysis of transcriptional units and transcriptional heterogeneity. In summary, we establish Nanopore RNA-seq to be a ready-to-use tool allowing rapid, cost-effective, and accurate annotation of multiple transcriptomic features thereby advancing it to become a standard method for RNA analysis in prokaryotes.

Download Full-text

Frequent POLE-Driven Hypermutation in Ovarian Endometrioid Cancer Revealed by Mutational Signatures in RNA Sequencing

10.21203/rs.3.rs-145368/v1 ◽

2021 ◽

Author(s):

Jaime Davila ◽

Pritha Chanana ◽

Vivekananda Sarangi ◽

Zach Fogarty ◽

John Weroha ◽

...

Keyword(s):

Rna Sequencing ◽

Mayo Clinic ◽

Somatic Mutations ◽

Age At Onset ◽

The Cancer Genome Atlas ◽

Driver Mutations ◽

Rna Seq ◽

Mutational Signatures ◽

Mutational Signature ◽

Hotspot Mutations

Abstract Background: DNA polymerase epsilon (POLE) is encoded by the POLE gene, and POLE-driven tumors are characterized by high mutational rates. POLE-driven tumors are relatively common in endometrial and colorectal cancer, and their presence is increasingly recognized in ovarian cancer (OC) of endometrioid type. POLE-driven cases possess an abundance of TCT>TAT and TCG>TTG somatic mutations characterized by mutational signature 10 from the Catalog of Somatic Mutations in Cancer (COSMIC). By quantifying the contribution of COSMIC mutational signature 10 in RNA sequencing (RNA-seq) we set out to identify POLE-driven tumors in a set of unselected Mayo Clinic OC. Methods: Mutational profiles were calculated using expressed single-nucleotide variants (eSNV) in the Mayo Clinic OC tumors (n=195), The Cancer Genome Atlas (TCGA) OC tumors (n=419), and the Genotype-Tissue Expression (GTEx) normal ovarian tissues (n=84). Non-negative Matrix Factorization (NMF) of the mutational profiles inferred the contribution per sample of four distinct mutational signatures, one of which corresponds to COSMIC mutational signature 10. Results: In the Mayo Clinic OC cohort we identified six tumors with a predicted contribution from COSMIC mutational signature 10 of over five mutations per megabase. These six cases harbored known POLE hotspot mutations (P286R, S297F, V411L, and A456P) and were of endometrioid histotype (P=5e-04). These six tumors were hypermutated with a higher tumor mutation load (mean, 54.02 mutations per megabase) compared to non-POLE endometrioid OC cases (mean, 7.69 mutations per megabase; P=5e-04), and had an early onset (average age of patients at onset, 48.33 years) when compared to non-POLE endometrioid OC cohort (average age at onset, 60.13 years; P=.008). Samples from TCGA and GTEx had a low COSMIC signature 10 contribution (median 0.16 mutations per megabase; maximum 1.78 mutations per megabase) and carried no POLE hotspot mutations.Conclusions: From the largest cohort of RNA-seq from endometrioid OC to date (n=53), we identified six hypermutated samples likely driven by POLE (frequency, 11%). Our result suggests the clinical need to screen for POLE driver mutations in endometrioid OC, which can guide enrollment in immunotherapy clinical trials.

Download Full-text

Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion

Scientific Reports ◽

10.1038/s41598-018-23226-4 ◽

2018 ◽

Vol 8 (1) ◽

Cited By ~ 41

Author(s):

Shanrong Zhao ◽

Ying Zhang ◽

Ramya Gamini ◽

Baohong Zhang ◽

David von Schack

Keyword(s):

Rna Sequencing ◽

Rna Seq ◽

Gene Quantification ◽

Rrna Depletion

Download Full-text

The expression of long non-coding RNAs is associated with H3Ac and H3K4me2 changes regulated by the HDA6-LDL1/2 histone modification complex in Arabidopsis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa066 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Fu-Yu Hung ◽

Chen Chen ◽

Ming-Ren Yen ◽

Jo-Wei Allison Hsieh ◽

Chenlong Li ◽

...

Keyword(s):

Histone Modification ◽

Enrichment Analysis ◽

Biological Processes ◽

Rna Seq ◽

Histone Deacetylase 6 ◽

Lncrna Expression ◽

Go Enrichment ◽

Non Coding Rnas ◽

Go Enrichment Analysis ◽

Low Expression

Abstract In recent years, eukaryotic long non-coding RNAs (lncRNAs) have been identified as important factors involved in a wide variety of biological processes, including histone modification, alternative splicing and transcription enhancement. The expression of lncRNAs is highly tissue-specific and is regulated by environmental stresses. Recently, a large number of plant lncRNAs have been identified, but very few of them have been studied in detail. Furthermore, the mechanism of lncRNA expression regulation remains largely unknown. Arabidopsis HISTONE DEACETYLASE 6 (HDA6) and LSD1-LIKE 1/2 (LDL1/2) can repress gene expression synergistically by regulating H3Ac/H3K4me. In this research, we performed RNA-seq and ChIP-seq analyses to further clarify the function of HDA6-LDL1/2. Our results indicated that the global expression of lncRNAs is increased in hda6/ldl1/2 and that this increased lncRNA expression is particularly associated with H3Ac/H3K4me2 changes. In addition, we found that HDA6-LDL1/2 is important for repressing lncRNAs that are non-expressed or show low-expression, which may be strongly associated with plant development. GO-enrichment analysis also revealed that the neighboring genes of the lncRNAs that are upregulated in hda6/ldl1/2 are associated with various developmental processes. Collectively, our results revealed that the expression of lncRNAs is associated with H3Ac/H3K4me2 changes regulated by the HDA6-LDL1/2 histone modification complex.

Download Full-text

Homeolog expression quantification methods for allopolyploids

10.1101/426437 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tony Kuo ◽

Masaomi Hatakeyama ◽

Toshiaki Tameshige ◽

Kentaro K. Shimizu ◽

Jun Sese

Keyword(s):

Sequence Similarity ◽

Diploid Species ◽

Ground Truth ◽

Error Rates ◽

Rna Seq ◽

Gene Copies ◽

The Difference ◽

Downstream Analysis ◽

High Level ◽

Expression Quantification

AbstractGenome duplication with hybridization, or allopolyploidization, occurs in animals, fungi, and plants, and is especially common in crop plants. There is increasing interest in the study of allopolyploids due to advances in polyploid genome assembly, however the high level of sequence similarity in duplicated gene copies (homeologs) pose many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (> 10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (< 1% using EAGLE-RC, < 2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method. In general, disagreement in low expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.

Download Full-text