scholarly journals CIDANE: Comprehensive isoform discovery and abundance estimation

2015 ◽  
Author(s):  
Stefan Canzar ◽  
Sandro Andreotti ◽  
David Weese ◽  
Knut Reinert ◽  
Gunnar W. Klau

We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts with significantly higher sensitivity and precision than existing tools, while competing in speed with the fastest methods. In addition to reconstructing transcripts ab initio, the algorithm also allows to make use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. CIDANE supports the integrated analysis of RNA-seq and additional gene-boundary data and recovers splice junctions that are invisible to other methods. CIDANE is available at http://ccb.jhu.edu/software/cidane/.

2017 ◽  
Author(s):  
Christopher Wilks ◽  
Phani Gaddipati ◽  
Abhinav Nellore ◽  
Ben Langmead

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.


2014 ◽  
Author(s):  
Michael O Duff ◽  
Sara Olson ◽  
Xintao Wei ◽  
Ahmad Osman ◽  
Alex Plocik ◽  
...  

Recursive splicing is a process in which large introns are removed in multiple steps by resplicing at ratchet points - 5? splice sites recreated after splicing. Recursive splicing was first identified in the Drosophila Ultrabithorax (Ubx) gene and only three additional Drosophila genes have since been experimentally shown to undergo recursive splicing. Here, we identify 196 zero nucleotide exon ratchet points in 130 introns of 115 Drosophila genes from total RNA sequencing data generated from developmental time points, dissected tissues, and cultured cells. Recursive splicing events were identified by splice junctions that map to annotated 5? splice sites and unannotated intronic 3? splice sites, the presence of the sequence AG/GT at the 3? splice site, and a 5? to 3? gradient of decreasing RNA-Seq read density indicative of co-transcriptional splicing. The sequential nature of recursive splicing was confirmed by identification of lariat introns generated by splicing to and from the ratchet points. We also show that recursive splicing is a constitutive process, and that the sequence and function of ratchet points are evolutionarily conserved. Together these results indicate that recursive splicing is commonly used in Drosophila and provides insight into the mechanisms by which some introns are removed.


2021 ◽  
Author(s):  
Runxuan Zhang ◽  
Richard Kuo ◽  
Max Coulter ◽  
Cristiane P.G. Calixto ◽  
Juan Carlos Entizne ◽  
...  

Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single molecule long read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 160k transcripts - twice that of the best current Arabidopsis transcriptome and including over 1,500 novel genes. 79% of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We developed novel methods to determine splice junctions and transcription start and end sites accurately. Mis-match profiles around splice junctions provided a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identified high confidence transcription start/end sites and removed fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provided higher resolution of transcript expression profiling and identified cold- and light-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently available. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single molecule sequencing analysis from any species.


Blood ◽  
2017 ◽  
Vol 130 (Suppl_1) ◽  
pp. 931-931
Author(s):  
Marilyn Parra ◽  
Benjamin W Booth ◽  
Gene W Yeo ◽  
Richard Weiszmann ◽  
Susan E Celniker ◽  
...  

Abstract Proper expression of the MDS-disease gene, SF3B1, ensures appropriate pre-mRNA splicing in erythroid progenitors and during terminal erythropoiesis. We previously showed that the SF3B1 gene is post-transcriptionally regulated in a differentiation stage-specific manner by intron retention (IR), such that ~50% of its transcripts in mature erythroblasts retain intron 4. Based on new mechanistic studies, we propose a model in which mostly unannotated and noncoding exons within intron 4 function as splicing decoys; i.e., they promote retention of intron 4 by interacting with, and blocking splice sites of, the adjacent exons 4 and 5. A total of six putative decoy exons were revealed via RT-PCR and RNA-seq analysis of RNA from erythroblasts treated with inhibitors of nonsense-mediated decay. That decoy exons have IR-promoting activity is suggested by several criteria. First, the frequency of interaction between constitutive exons 4 and 5 and putative decoy exons within intron 4, measured by the abundance of splice junctions in RNA-seq read data, is temporally correlated with levels of intron 4 retention during terminal erythropoiesis. Both IR and decoy splice junctions were low in early stage erythroblasts and much higher in mature erythroblasts. Second, selected decoy exons exhibited IR-promoting activity in the context of minigene splicing reporters expressing the exon 3-6 region of SF3B1 in transfected K562 cells. The wild type minigene reproduced the intron-specific retention phenotype, since it was fully spliced at introns 3 and 5 but exhibited substantial retention of intron 4, whereas deletion of decoy exon 4e, or mutation of its splice sites, substantially decreased IR. Third, RBP (RNA binding protein) cross-linking data from K562 cells show that 3' splice site factors including U2AF1 and U2AF2 can bind specifically to 3' splice sites of intron 4's decoy exons. Finally, several experiments showed that IR-promoting activity of decoy exons is a more general phenomenon that likely governs IR in other erythroid genes. We observed not only that SF3B1 intron 4 decoy exons could promote IR in heterologous contexts, but also that predicted decoy exons from other erythroblast transcripts could promote IR in the SF3B1 minigene. Apart from this experimental data, comparative genomics revealed that the SF3B1 decoy exons are extremely conserved among vertebrate genomes, with two of the exons being essentially identical from fish to humans. Together this data supports the hypothesis that a subset of up-regulated IR events in late erythroblasts are controlled by decoy exons that block productive splicing at the flanking exons. We propose that regulated IR is an important post-transcriptional mechanism for adjusting cellular splicing capacity during terminal erythropoiesis by regulating expression of key splicing factors such as SF3B1. Disclosures No relevant conflicts of interest to declare.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Alexandre Souvorov ◽  
Richa Agarwala

Abstract Background Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions. Results To facilitate assembly of repeat regions and to report multiple well supported variants when a user can provide target sequences to assist the assembly, we propose SAUTE and SAUTE_PROT assemblers. Both assemblers use de Bruijn graph on reads. Targets can be transcripts or proteins for RNA-seq reads and transcripts, proteins, or genomic regions for genomic reads. Target sequences are nucleotide and protein sequences for SAUTE and SAUTE_PROT, respectively. Conclusions For RNA-seq, comparisons with Trinity, rnaSPAdes, SPAligner, and SPAdes assembly of reads aligned to target proteins by DIAMOND show that SAUTE_PROT finds more coding sequences that translate to benchmark proteins. Using AMRFinderPlus calls, we find SAUTE has higher sensitivity and precision than SPAdes, plasmidSPAdes, SPAligner, and SPAdes assembly of reads aligned to target regions by HISAT2. It also has better sensitivity than SKESA but worse precision.


2017 ◽  
Vol 77 (23) ◽  
pp. 6538-6550 ◽  
Author(s):  
Dylan Z. Kelley ◽  
Emily L. Flam ◽  
Evgeny Izumchenko ◽  
Ludmila V. Danilova ◽  
Hildegard A. Wulf ◽  
...  

Author(s):  
Elena Espinosa ◽  
Macarena Arroyo ◽  
Rafael Larrosa ◽  
Manuel Manchado ◽  
M. Gonzalo Claros ◽  
...  
Keyword(s):  

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3702 ◽  
Author(s):  
Santiago Montero-Mendieta ◽  
Manfred Grabherr ◽  
Henrik Lantz ◽  
Ignacio De la Riva ◽  
Jennifer A. Leonard ◽  
...  

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.


2021 ◽  
Author(s):  
Taguchi Y-h. ◽  
Turki Turki

Abstract The integrated analysis of multiple gene expression profiles measured in distinct studies is always problematic. Especially, missing sample matching and missing common labeling between distinct studies prevent the integration of multiple studies in fully data-driven and unsupervised manner. In this study, we propose a strategy enabling the integration of multiple gene expression profiles among multiple independent studies without either labeling or sample matching, using tensor decomposition-based unsupervised feature extraction. As an example, we applied this strategy to Alzheimer’s disease (AD)-related gene expression profiles that lack exact correspondence among samples as well as AD single-cell RNA-seq (scRNA-seq) data. We found that we could select biologically reasonable genes with integrated analysis. Overall, integrated gene expression profiles can function analogously to prior learning and/or transfer learning strategies in other machine learning applications. For scRNA-seq, the proposed approach was able to drastically reduce the required computational memory.


Sign in / Sign up

Export Citation Format

Share Document