scholarly journals Towards selective-alignment: Bridging the accuracy gap between alignment-based and alignment-free transcript quantification

2017 ◽  
Author(s):  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Laraib Malik ◽  
Rob Patro

AbstractMotivationWe introduce an algorithm for selectively aligning high-throughput sequencing reads to a transcriptome, with the goal of improving transcript-level quantification. This algorithm attempts to bridge the gap between fast “mapping” algorithms and more traditional alignment procedures.ResultsWe adopt a hybrid approach that is able to increase mapping accuracy while still retaining much of the efficiency of fast mapping algorithms. To achieve this, we introduce a new approach that explores the candidate search space with high sensitivity as well as a collection of carefully-engineered heuristics to efficiently filter these candidates. Additionally, unlike the strategies adopted in most aligners which first align the ends of paired-end reads independently, we introduce a notion of co-mapping. This procedure exploits relevant information between the “hits” from the left and right ends of paired-end reads before full alignments or mappings for each are generated, which improves the efficiency of filtering likely-spurious alignments. Finally, we demonstrate the utility of selective alignment in improving the accuracy of efficient transcript-level quantification from RNA-seq reads. Specifically, we show that selective-alignment is able to resolve certain complex mapping scenarios that can confound existing fast mapping procedures, while simultaneously eliminating spurious alignments that fast mapping approaches can produce.AvailabilitySelective-alignment is implemented in C++11 as a part of Salmon, and is available as open source software, under GPL v3, at: https://github.com/COMBINE-lab/salmon/tree/[email protected]

2021 ◽  
Author(s):  
Felix Gruenberger ◽  
Sebastien Ferreira-Cerca ◽  
Dina Grohmann

High-throughput sequencing dramatically changed our view of transcriptome architectures and allowed for ground-breaking discoveries in RNA biology. Recently, sequencing of full-length transcripts based on the single-molecule sequencing platform from Oxford Nanopore Technologies (ONT) was introduced and is widely employed to sequence eukaryotic and viral RNAs. However, experimental approaches implementing this technique for prokaryotic transcriptomes remain scarce. Here, we present an experimental and bioinformatic workflow for ONT RNA-seq in the bacterial model organism Escherichia coli, which can be applied to any microorganism. Our study highlights critical steps of library preparation and computational analysis and compares the results to gold standards in the field. Furthermore, we comprehensively evaluate the applicability and advantages of different ONT-based RNA sequencing protocols, including direct RNA, direct cDNA, and PCR-cDNA. We find that cDNA-seq offers improved yield and accuracy without bias in quantification compared to direct RNA sequencing. Notably, cDNA-seq can be readily used for simultaneous transcript quantification, accurate detection of transcript 5 ′ and 3′ boundaries, analysis of transcriptional units and transcriptional heterogeneity. In summary, we establish Nanopore RNA-seq to be a ready-to-use tool allowing rapid, cost-effective, and accurate annotation of multiple transcriptomic features thereby advancing it to become a standard method for RNA analysis in prokaryotes.


F1000Research ◽  
2016 ◽  
Vol 4 ◽  
pp. 1521 ◽  
Author(s):  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.


2018 ◽  
Author(s):  
Jack M. Fu ◽  
Kai Kammers ◽  
Abhinav Nellore ◽  
Leonardo Collado-Torres ◽  
Jeffrey T. Leek ◽  
...  

AbstractMore than 70,000 short-read RNA-sequencing samples are publicly available through the recount2 project, a curated database of summary coverage data. However, no current methods can be directly applied to the reduced-representation information stored in this database to estimate transcript-level abundances. Here we present a linear model taking as input summary coverage of junctions and subdivided exons to output estimated abundances and associated uncertainty. We evaluate the performance of our model on simulated and real data, and provide a procedure to construct confidence intervals for estimates.


2019 ◽  
Author(s):  
Camille Sessegolo ◽  
Corinne Cruaud ◽  
Corinne Da Silva ◽  
Audric Cologne ◽  
Marion Dubarry ◽  
...  

AbstractOur vision of DNA transcription and splicing has changed dramatically with the introduction of short-read sequencing. These high-throughput sequencing technologies promised to unravel the complexity of any transcriptome. Generally gene expression levels are well-captured using these technologies, but there are still remaining caveats due to the limited read length and the fact that RNA molecules had to be reverse transcribed before sequencing. Oxford Nanopore Technologies has recently launched a portable sequencer which offers the possibility of sequencing long reads and most importantly RNA molecules. Here we generated a full mouse transcriptome from brain and liver using the Oxford Nanopore device. As a comparison, we sequenced RNA (RNA-Seq) and cDNA (cDNA-Seq) molecules using both long and short reads technologies and tested the TeloPrime preparation kit, dedicated to the enrichment of full-length transcripts. Using spike-in data, we confirmed that expression levels are efficiently captured by cDNA-Seq using short reads. More importantly, Oxford Nanopore RNA-Seq tends to be more efficient, while cDNA-Seq appears to be more biased. We further show that the cDNA library preparation of the Nanopore protocol induces read truncation for transcripts containing internal runs of T’s. This bias is marked for runs of at least 15 T’s, but is already detectable for runs of at least 9 T’s and therefore concerns more than 20% of expressed transcripts in mouse brain and liver. Finally, we outline that bioinformatics challenges remain ahead for quantifying at the transcript level, especially when reads are not full-length. Accurate quantification of repeat-associated genes such as processed pseudogenes also remains difficult, and we show that current mapping protocols which map reads to the genome largely over-estimate their expression, at the expense of their parent gene. The entire dataset is available from http://www.genoscope.cns.fr/externe/ONT_mouse_RNA.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Camille Sessegolo ◽  
Corinne Cruaud ◽  
Corinne Da Silva ◽  
Audric Cologne ◽  
Marion Dubarry ◽  
...  

Abstract Our vision of DNA transcription and splicing has changed dramatically with the introduction of short-read sequencing. These high-throughput sequencing technologies promised to unravel the complexity of any transcriptome. Generally gene expression levels are well-captured using these technologies, but there are still remaining caveats due to the limited read length and the fact that RNA molecules had to be reverse transcribed before sequencing. Oxford Nanopore Technologies has recently launched a portable sequencer which offers the possibility of sequencing long reads and most importantly RNA molecules. Here we generated a full mouse transcriptome from brain and liver using the Oxford Nanopore device. As a comparison, we sequenced RNA (RNA-Seq) and cDNA (cDNA-Seq) molecules using both long and short reads technologies and tested the TeloPrime preparation kit, dedicated to the enrichment of full-length transcripts. Using spike-in data, we confirmed that expression levels are efficiently captured by cDNA-Seq using short reads. More importantly, Oxford Nanopore RNA-Seq tends to be more efficient, while cDNA-Seq appears to be more biased. We further show that the cDNA library preparation of the Nanopore protocol induces read truncation for transcripts containing internal runs of T’s. This bias is marked for runs of at least 15 T’s, but is already detectable for runs of at least 9 T’s and therefore concerns more than 20% of expressed transcripts in mouse brain and liver. Finally, we outline that bioinformatics challenges remain ahead for quantifying at the transcript level, especially when reads are not full-length. Accurate quantification of repeat-associated genes such as processed pseudogenes also remains difficult, and we show that current mapping protocols which map reads to the genome largely over-estimate their expression, at the expense of their parent gene.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 1521 ◽  
Author(s):  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.


Genes ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 794
Author(s):  
Cullen Horstmann ◽  
Victoria Davenport ◽  
Min Zhang ◽  
Alyse Peters ◽  
Kyoungtae Kim

Next-generation sequencing (NGS) technology has revolutionized sequence-based research. In recent years, high-throughput sequencing has become the method of choice in studying the toxicity of chemical agents through observing and measuring changes in transcript levels. Engineered nanomaterial (ENM)-toxicity has become a major field of research and has adopted microarray and newer RNA-Seq methods. Recently, nanotechnology has become a promising tool in the diagnosis and treatment of several diseases in humans. However, due to their high stability, they are likely capable of remaining in the body and environment for long periods of time. Their mechanisms of toxicity and long-lasting effects on our health is still poorly understood. This review explores the effects of three ENMs including carbon nanotubes (CNTs), quantum dots (QDs), and Ag nanoparticles (AgNPs) by cross examining publications on transcriptomic changes induced by these nanomaterials.


Diagnostics ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 964
Author(s):  
Sarka Benesova ◽  
Mikael Kubista ◽  
Lukas Valihrach

MicroRNAs (miRNAs) are a class of small RNA molecules that have an important regulatory role in multiple physiological and pathological processes. Their disease-specific profiles and presence in biofluids are properties that enable miRNAs to be employed as non-invasive biomarkers. In the past decades, several methods have been developed for miRNA analysis, including small RNA sequencing (RNA-seq). Small RNA-seq enables genome-wide profiling and analysis of known, as well as novel, miRNA variants. Moreover, its high sensitivity allows for profiling of low input samples such as liquid biopsies, which have now found applications in diagnostics and prognostics. Still, due to technical bias and the limited ability to capture the true miRNA representation, its potential remains unfulfilled. The introduction of many new small RNA-seq approaches that tried to minimize this bias, has led to the existence of the many small RNA-seq protocols seen today. Here, we review all current approaches to cDNA library construction used during the small RNA-seq workflow, with particular focus on their implementation in commercially available protocols. We provide an overview of each protocol and discuss their applicability. We also review recent benchmarking studies comparing each protocol’s performance and summarize the major conclusions that can be gathered from their usage. The result documents variable performance of the protocols and highlights their different applications in miRNA research. Taken together, our review provides a comprehensive overview of all the current small RNA-seq approaches, summarizes their strengths and weaknesses, and provides guidelines for their applications in miRNA research.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Rebecca A. Dagg ◽  
Gijs Zonderland ◽  
Emilia Puig Lombardi ◽  
Giacomo G. Rossetti ◽  
Florian J. Groelly ◽  
...  

AbstractBRCA1 or BRCA2 germline mutations predispose to breast, ovarian and other cancers. High-throughput sequencing of tumour genomes revealed that oncogene amplification and BRCA1/2 mutations are mutually exclusive in cancer, however the molecular mechanism underlying this incompatibility remains unknown. Here, we report that activation of β-catenin, an oncogene of the WNT signalling pathway, inhibits proliferation of BRCA1/2-deficient cells. RNA-seq analyses revealed β-catenin-induced discrete transcriptome alterations in BRCA2-deficient cells, including suppression of CDKN1A gene encoding the CDK inhibitor p21. This accelerates G1/S transition, triggering illegitimate origin firing and DNA damage. In addition, β-catenin activation accelerates replication fork progression in BRCA2-deficient cells, which is critically dependent on p21 downregulation. Importantly, we find that upregulated p21 expression is essential for the survival of BRCA2-deficient cells and tumours. Thus, our work demonstrates that β-catenin toxicity in cancer cells with compromised BRCA1/2 function is driven by transcriptional alterations that cause aberrant replication and inflict DNA damage.


Sign in / Sign up

Export Citation Format

Share Document