scholarly journals Exact transcript quantification over splice graphs

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Cong Ma ◽  
Hongyu Zheng ◽  
Carl Kingsford

Abstract Background The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447–55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. Results We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. Conclusion We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis.

2019 ◽  
Author(s):  
Cong Ma ◽  
Carl Kingsford

AbstractAlgorithms to infer isoform expression abundance from RNA-seq have been greatly improved in accuracy during the past ten years. However, due to incomplete reference transcriptomes, mapping errors, incomplete sequencing bias models, or mistakes made by the algorithm, the quantification model sometimes could not explain all aspects of the input read data, and misquantification can occur. Here, we develop a computational method to detect instances where a quantification model could not thoroughly explain the input. Specifically, our approach identifies transcripts where the read coverage has significant deviations from the expectation. We call these transcripts “expression anomalies”, and they represent instances where the quantification estimates may be in doubt. We further develop a method to attribute the cause of anomalies to either the incompleteness of the reference transcriptome or the algorithmic mistakes, and we show that our method precisely detects misquantifications with both causes. By correcting the misquantifications that are labeled as algorithmic mistakes, the number of false predictions of differentially expressed transcripts can be reduced. Applying anomaly detection to 30 GEUVADIS and 16 Human Body Map samples, we detect 103 genes with potential unannotated isoforms. These genes tend to be longer than average, and contain a very long exon near 3′ end that the unannotated isoform excludes. Anomaly detection is a new approach for investigating the expression quantification problem that may find wider use in other areas of genomics.


2019 ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


2021 ◽  
Author(s):  
Wenbin Guo ◽  
Max Coulter ◽  
Robbie Waugh ◽  
Runxuan Zhang

High quality transcriptome assembly using short reads from RNA-seq data still heavily relies upon reference-based approaches, of which the primary step is to align RNA-seq reads to a single reference genome of haploid sequence. However, it is increasingly apparent that while different genotypes within a species share core genes, they also contain variable numbers of specific genes that are only present a subset of individuals. Using a common reference may thus lead to a loss of genotype-specific information in the assembled transcript dataset and the generation of erroneous, incomplete or misleading transcriptomics analysis results. With the recent development of pan-genome information in many species, it is important that we understand the limitations of single genotype references for transcriptomics analysis. In this study, we quantitively evaluated the advantages of using genotype-specific reference genomes for transcriptome assembly and analysis using cultivated barley as a model. We mapped barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD), respectively. We compared the two RTDs according to their transcript diversity, transcript sequence and structure similarity and the accuracy they provided for transcript quantification and differential expression analysis. Our evaluation shows that the sRTD has a significantly higher diversity of transcripts and alternative splicing events. Despite using a high-quality reference genome for assembly of the cRTD, we miss ca. 40% transcripts present in the sRTD and cRTD only has ca. 70% true assemblies. We found that the sRTD is more accurate for transcript quantification as well as differential expression and differential alternative splicing analysis. However, gene level quantification and comparative expression analysis are less affected by the source RTD, which indicates that analysing transcriptomic data at the gene level may be a reasonable compromise when a high-quality genotype-specific reference is not available.


2022 ◽  
Author(s):  
Sagnik Banerjee ◽  
Carson Andorf

Advancement in technology has enabled sequencing machines to produce vast amounts of genetic data, causing an increase in storage demands. Most genomic software utilizes read alignments for several purposes including transcriptome assembly and gene count estimation. Herein we present, ABRIDGE, a state-of-the-art compressor for SAM alignment files offering users both lossless and lossy compression options. This reference-based file compressor achieves the best compression ratio among all compression software ensuring lower space demand and faster file transmission. Central to the software is a novel algorithm that retains non-redundant information. This new approach has allowed ABRIDGE to achieve a compression 16% higher than the second-best compressor for RNA-Seq reads and over 35% for DNA-Seq reads. ABRIDGE also offers users the option to randomly access location without having to decompress the entire file. ABRIDGE is distributed under MIT license and can be obtained from GitHub and docker hub. We anticipate that the user community will adopt ABRIDGE within their existing pipeline encouraging further research in this domain.


GigaScience ◽  
2019 ◽  
Vol 8 (12) ◽  
Author(s):  
Hong Zheng ◽  
Kevin Brennan ◽  
Mikel Hernaez ◽  
Olivier Gevaert

Abstract Background Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. Results In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. Conclusions Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.


2018 ◽  
Author(s):  
Ping-Han Hsieh ◽  
Yen-Jen Oyang ◽  
Chien-Yu Chen

AbstractBackgroundCorrect quantification of transcript expression is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For such projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of erroneous contigs produced by the assemblers might result in unreliable estimation on the abundance of transcripts. In this regard, this study comprehensively investigates how assembly quality affects the performance of quantification for RNA-Seq analysis based on de novo transcriptome assembly.ResultsSeveral important factors that might seriously affect the accuracy of the RNA-Seq analysis were thoroughly discussed. First, we found that the assemblers perform comparatively well for the transcriptomes with lower biological complexity. Second, we examined the over-extended and incomplete contigs, and then demonstrated that assembly completeness has a strong impact on the estimation of contig abundance. Lastly, we investigated the behavior of the quantifiers with respect to sequence ambiguity which might be originally present in the transcriptome or accidentally produced by assemblers. The results suggest that the quantifiers often over-estimate the expression of family-collapse contigs and under-estimate the expression of duplicated contigs. For organisms without reference transcriptome, it remained challenging to detect the inaccurate abundance estimation on family-collapse contigs. On the contrary, we observed that the situation of under-estimation on duplicated contigs can be warned through analyzing the read distribution of the duplicated contigs.ConclusionsIn summary, we explicated the behavior of quantifiers when erroneous contigs are present and we outlined the potential problems that the assemblers might cause for the downstream analysis of RNA-Seq. We anticipate the analytic results conducted in this study provides valuable insights for future development of transcriptome assembly and quantification.Availabilitywe proposed an open-source Python based package QuantEval that builds connected components for the assembled contigs based on sequence similarity and evaluates the quantification results for each connected component. The package can be downloaded from https://github.com/dn070017/QuantEval.


2019 ◽  
Author(s):  
Cong Ma ◽  
Hongyu Zheng ◽  
Carl Kingsford

AbstractCurrent expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Current quantification methods rely on probabilistic models, and the scenario where it admits multiple optimal solutions is called non-identifiability. This means multiple estimates of transcript abundances generate the observed RNA-seq reads equally likely, and the underlying true expression cannot be determined. The non-identifiability problem is further exacerbated when incompleteness of reference transcriptome and existence of unannotated transcripts are taken into consideration. State-of-the-art quantification methods usually output a single inferred set of abundances, and the accuracy of the single expression set is unknown compared to other equally optimal solutions. Obtaining the set of equally optimal solutions is necessary for evaluating and extending downstream analyses to take non-identifiability into account. We propose methods to compute the range of equally optimal estimates for the expression of each transcript, accounting for non-identifiability of the quantification model using several novel graph theoretical approaches. It works under two scenarios, one assuming the reference transcriptome is complete, another assuming incomplete reference and allowing for expression of unannotated transcripts. Our methods calculate a “confidence” range for each transcript, representing its possible abundance across equally optimal estimates. This range can be used for evaluating the reliability of detected differentially expressed (DE) transcripts, as a large overlap of confidence range between DE conditions indicates the prediction may be unreliable due to uncertainty. We observe that 5 out of 257 DE predictions are unreliable on an MCF10 cell line and 19 out of 3152 are unreliable on a CD8 T cell dataset. The source code can be found at https://github.com/Kingsford-Group/subgraphquant.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

Abstract Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


2014 ◽  
Author(s):  
Li Song ◽  
Sarven Sabunciyan ◽  
Liliana D Florea

Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to capture splicing variation accurately has proven difficult. Current programs find major isoforms of a gene but miss finer splicing variations, or are sensitive but highly imprecise. We present CLASS, a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads. CLASS employs a splice graph to represent a gene and its splice variants, combined with a linear program to determine an accurate set of exons and efficient splice graph-based transcript selection algorithms. When compared against reference programs, CLASS had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool that produced consistently reliable transcript models for a wide range of applications and sequencing strategies, including very large data sets and ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS required <3GB RAM and less than one day to analyze a 350 million read set, and is an excellent choice for transcriptomics studies, from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes.


Sign in / Sign up

Export Citation Format

Share Document