splice graph
Recently Published Documents


TOTAL DOCUMENTS

11
(FIVE YEARS 6)

H-INDEX

2
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Qimin Zhang ◽  
Qian Shi ◽  
Mingfu Shao

AbstractTranscript assembly (i.e., to reconstruct the full-length expressed transcripts from RNA-seq data) has been a critical but yet unsolved step in RNA-seq analysis. Modern RNA-seq protocols can produce paired-/multiple-end RNA-seq reads, where information is available that two or more reads originate from the same transcript. The long-range constraints implied in these paired-/multiple-end reads can be much beneficial in correctly phasing the complicated spliced isoforms. However, there often exist gaps among individual ends, which may even contain junctions, making the efficient use of such constraints algorithmically challenging. Here we introduce Scallop2, a new reference-based transcript assembler optimized for multiple-end (including paired-end) RNA-seq data. Scallop2 uses an algorithmic frame-work that first represents reads from the same molecule as the so-called multiple-end phasing paths in the context of a splice graph, then “bridges” each multiple-end phasing path into a long, single-end phasing path, and finally decomposes the splice graph into paths (i.e., transcripts) guided by the bridged phasing paths. An efficient bridging algorithm is designed to infer the true path connecting two consecutive ends following a novel formulation that is robust to sequencing errors and transcript noises. By observing that failing to bridge two ends is mainly due to incomplete splice graphs, we propose a new method to determine false starting/ending vertices of the splice graphs which has been showed efficient in reducing false positive rate. Evaluations on both (multiple-end) single-cell RNA-seq datasets from Smart-seq3 protocol and Illumina paired-end RNA-seq samples demonstrate that Scallop2 vastly outperforms recent assemblers including StringTie2, Scallop, and CLASS2 in assembly accuracy.


2021 ◽  
Author(s):  
Palash Sashittal ◽  
Chuanyi Zhang ◽  
Jian Peng ◽  
Mohammed El-Kebir

Abstract Genes in SARS-CoV-2 and other viruses in the order of Nidovirales are expressed by a process of discontinuous transcription mediated by the viral RNA-dependent RNA polymerase. This process is distinct from alternative splicing in eukaryotes and produces subgenomic RNAs that express different viral genes. Here, we introduce the DISCONTINUOUS TRANSCRIPT ASSEMBLY problem of finding transcripts T and their abundances c given an alignment R of paired end short reads under a maximum likelihood model that accounts for varying transcript lengths. Underpinning our approach is the concept of a segment graph, a directed acyclic graph that, distinct from the splice graph used to characterize alternative splicing, has a unique Hamiltonian path. We provide a compact characterization of solutions as subsets of non-overlapping edges in this graph, enabling the formulation of an efficient progressive heuristic that uses mixed integer linear program. We show using simulations that our method, JUMPER, drastically outperforms existing methods for classical transcript assembly. On short-read data of SARS-CoV-1, SARS-CoV-2 and MERS-CoV samples, we find that JUMPER not only identifies canonical transcripts that are part of the reference transcriptome, but also predicts expression of non-canonical transcripts that are well supported by direct evidence from long-read data, presence in multiple, independent samples or a conserved core sequence. Moreover, application of JUMPER on samples with and without treatment reveals viral drug response at the transcript level. As such, JUMPER enables detailed analyses of Nidovirales transcriptomes under varying conditions.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Cong Ma ◽  
Hongyu Zheng ◽  
Carl Kingsford

Abstract Background The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447–55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. Results We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. Conclusion We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis.


2021 ◽  
Author(s):  
Palash Sashittal ◽  
Chuanyi Zhang ◽  
Jian Peng ◽  
Mohammed El-Kebir

AbstractGenes in SARS-CoV-2 and, more generally, in viruses in the order of Nidovirales are expressed by a process of discontinuous transcription mediated by the viral RNA-dependent RNA polymerase. This process is distinct from alternative splicing in eukaryotes, rendering current transcript assembly methods unsuitable to Nidovirales sequencing samples. Here, we introduce the Discontinuous Transcript Assembly problem of finding transcripts and their abundances c given an alignment under a maximum likelihood model that accounts for varying transcript lengths. Underpinning our approach is the concept of a segment graph, a directed acyclic graph that, distinct from the splice graph used to characterize alternative splicing, has a unique Hamiltonian path. We provide a compact characterization of solutions as subsets of non-overlapping edges in this graph, enabling the formulation of an efficient mixed integer linear program. We show using simulations that our method, Jumper, drastically outperforms existing methods for classical transcript assembly. On short-read data of SARS-CoV-1 and SARS-CoV-2 samples, we find that Jumper not only identifies canonical transcripts that are part of the reference transcriptome, but also predicts expression of non-canonical transcripts that are well supported by direct evidence from long-read data, presence in multiple, independent samples or a conserved core sequence. Jumper enables detailed analyses of Nidovirales transcriptomes.Code availabilitySoftware is available at https://github.com/elkebir-group/Jumper


2020 ◽  
Vol 10 (16) ◽  
pp. 8880-8893
Author(s):  
Jorge Langa ◽  
Andone Estonba ◽  
Darrell Conklin

2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Li Song ◽  
Sarven Sabunciyan ◽  
Guangyu Yang ◽  
Liliana Florea

Abstract Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.


2017 ◽  
Vol 34 (10) ◽  
pp. 1697-1704 ◽  
Author(s):  
Hamza Khan ◽  
Hamid Mohamadi ◽  
Benjamin P Vandervalk ◽  
Rene L Warren ◽  
Justin Chu ◽  
...  

2014 ◽  
Author(s):  
Li Song ◽  
Sarven Sabunciyan ◽  
Liliana D Florea

Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to capture splicing variation accurately has proven difficult. Current programs find major isoforms of a gene but miss finer splicing variations, or are sensitive but highly imprecise. We present CLASS, a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads. CLASS employs a splice graph to represent a gene and its splice variants, combined with a linear program to determine an accurate set of exons and efficient splice graph-based transcript selection algorithms. When compared against reference programs, CLASS had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool that produced consistently reliable transcript models for a wide range of applications and sequencing strategies, including very large data sets and ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS required <3GB RAM and less than one day to analyze a 350 million read set, and is an excellent choice for transcriptomics studies, from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes.


Sign in / Sign up

Export Citation Format

Share Document