scholarly journals LIQA: long-read isoform quantification and analysis

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yu Hu ◽  
Li Fang ◽  
Xuelian Chen ◽  
Jiang F. Zhong ◽  
Mingyao Li ◽  
...  

AbstractLong-read RNA sequencing (RNA-seq) technologies can sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over short-read RNA-seq. We present LIQA to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read direct mRNA sequencing or cDNA sequencing data. LIQA incorporates base pair quality score and isoform-specific read length information in a survival model to assign different weights across reads, and uses an expectation-maximization algorithm for parameter estimation. We apply LIQA to long-read RNA-seq data from the Universal Human Reference, acute myeloid leukemia, and esophageal squamous epithelial cells and demonstrate its high accuracy in profiling alternative splicing events.

2020 ◽  
Author(s):  
Yu Hu ◽  
Li Fang ◽  
Xuelian Chen ◽  
Jiang F. Zhong ◽  
Mingyao Li ◽  
...  

AbstractLong-read RNA sequencing (RNA-seq) technologies have made it possible to sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over conventional short-read RNA-seq. However, long-read RNA-seq suffers from high per-base error rate, presence of chimeric reads and alternative alignments, and other biases, which require different analysis methods than short-read RNA-seq. Here we present LIQA (Long-read Isoform Quantification and Analysis), an Expectation-Maximization based statistical method to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read RNA-seq data. Rather than summarizing isoform-specific read counts directly as done in short-read methods, LIQA incorporates base-pair quality score and isoform-specific read length information to assign different weights across reads, which reflects alignment confidence. Moreover, given isoform usage estimates, LIQA can detect DAS events between conditions. We evaluated LIQA’s performance on simulated data and demonstrated that it outperforms other approaches in rare isoform characterization and in detecting DAS events between two groups. We also generated one direct mRNA sequencing dataset and one cDNA sequencing dataset using the Oxford Nanopore long-read platform, both with paired short-read RNA-seq data and qPCR data on selected genes, and we demonstrated that LIQA performs well in isoform discovery and quantification. Finally, we evaluated LIQA on a PacBio dataset on esophageal squamous epithelial cells, and demonstrated that LIQA recovered DAS events on FGFR3 that failed to be detected in short-read data. In summary, LIQA leverages the power of long-read RNA-seq and achieves higher accuracy in estimating isoform abundance than existing approaches, especially for isoforms with low coverage and biased read distribution. LIQA is freely available for download at https://github.com/WGLab/LIQA.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Mathys Grapotte ◽  
Manu Saraswat ◽  
Chloé Bessière ◽  
Christophe Menichelli ◽  
Jordan A. Ramilowski ◽  
...  

AbstractUsing the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Emily Berger ◽  
Deniz Yorukoglu ◽  
Lillian Zhang ◽  
Sarah K. Nyquist ◽  
Alex K. Shalek ◽  
...  

Abstract Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X’s feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10×  faster than other tools. The advantage of HapTree-X’s ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.


2017 ◽  
Author(s):  
Tslil Gabrieli ◽  
Hila Sharim ◽  
Yael Michaeli ◽  
Yuval Ebenstein

ABSTRACTVariations in the genetic code, from single point mutations to large structural or copy number alterations, influence susceptibility, onset, and progression of genetic diseases and tumor transformation. Next-generation sequencing analysis is unable to reliably capture aberrations larger than the typical sequencing read length of several hundred bases. Long-read, single-molecule sequencing methods such as SMRT and nanopore sequencing can address larger variations, but require costly whole genome analysis. Here we describe a method for isolation and enrichment of a large genomic region of interest for targeted analysis based on Cas9 excision of two sites flanking the target region and isolation of the excised DNA segment by pulsed field gel electrophoresis. The isolated target remains intact and is ideally suited for optical genome mapping and long-read sequencing at high coverage. In addition, analysis is performed directly on native genomic DNA that retains genetic and epigenetic composition without amplification bias. This method enables detection of mutations and structural variants as well as detailed analysis by generation of hybrid scaffolds composed of optical maps and sequencing data at a fraction of the cost of whole genome sequencing.


2018 ◽  
Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


Author(s):  
Ann McCartney ◽  
Elena Hilario ◽  
Seung-Sub Choi ◽  
Joseph Guhlin ◽  
Jessie Prebble ◽  
...  

We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.


2021 ◽  
Author(s):  
Dongshunyi Li ◽  
Jun Ding ◽  
Ziv Bar-Joseph

One of the first steps in the analysis of single cell RNA-Sequencing data (scRNA-Seq) is the assignment of cell types. While a number of supervised methods have been developed for this, in most cases such assignment is performed by first clustering cells in low-dimensional space and then assigning cell types to different clusters. To overcome noise and to improve cell type assignments we developed UNIFAN, a neural network method that simultaneously clusters and annotates cells using known gene sets. UNIFAN combines both, low dimension representation for all genes and cell specific gene set activity scores to determine the clustering. We applied UNIFAN to human and mouse scRNA-Seq datasets from several different organs. As we show, by using knowledge on gene sets, UNIFAN greatly outperforms prior methods developed for clustering scRNA-Seq data. The gene sets assigned by UNIFAN to different clusters provide strong evidence for the cell type that is represented by this cluster making annotations easier.


2017 ◽  
Author(s):  
Shaojun Tang ◽  
Subha Madhavan

AbstractStudies indicate that more than 90% of human genes are alternatively spliced, suggesting the complexity of the transcriptome assembly and analysis. The splicing process is often disrupted, resulting in both functional and non-functional end-products (Sveen et al. 2016) in many cancers. Harnessing the immune system to fight against malignant cancers carrying aberrantly mutated or spliced products is becoming a promising approach to cancer therapy. Advances in immune checkpoint blockade have elicited adaptive immune responses with promising clinical responses to treatments against human malignancies (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Emerging data suggest that recognition of patient-specific mutation-associated cancer antigens (i.e. from alternative splicing isoforms) may allow scientists to dissect the immune response in the activity of clinical immunotherapies (Schumacher and Schreiber 2015). The advent of high-throughput sequencing technology has provided a comprehensive view of both splicing aberrations and somatic mutations across a range of human malignancies, allowing for a deeper understanding of the interplay of various disease mechanisms.Meanwhile, studies show that the number of transcript isoforms reported to date may be limited by the short-read sequencing due to the inherit limitation of transcriptome reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble full-length transcripts from short reads. The analysis of these high-throughput long-read sequencing data may permit a systematic view of tumor specific peptide epitopes (also known as neoantigens) that could serve as targets for immunotherapy (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017).Currently, there is no software pipeline available that can efficiently produce mutation-associated cancer antigens from raw high-throughput sequencing data on patient tumor DNA (The Problem with Neoantigen Prediction 2017). In addressing this issue, we introduce a R package that allows the discoveries of peptide epitope candidates, which are the tumor-specific peptide fragments containing potential functional neoantigens. These peptide epitopes consist of structure variants including insertion, deletions, alternative sequences, and peptides from nonsynonymous mutations. Analysis of these precursor candidates with widely used tools such as netMHC allows for the accurate in-silico prediction of neoantigens. The pipeline named neoantigeR is currently hosted in https://github.com/ICBI/neoantigeR.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 457-457
Author(s):  
Govardhan Anande ◽  
Ashwin Unnikrishnan ◽  
Nandan Deshpande ◽  
Sylvain Mareschal ◽  
Aarif M. N. Batcha ◽  
...  

RNA splicing is a fundamental biological process that generates protein diversity from a finite set of genes. Recurrent somatic mutations of genes involved in RNA splicing are present at high frequency in Myelodysplasia (up to 70%) but less so in Acute Myeloid Leukemia (AML; less than 20%). To investigate whether there were aberrant and recurrent RNA splicing events in the AML transcriptome that were associated with poor prognosis in the absence of splicing factor mutations, we developed a bioinformatics pipeline to systematically annotate and quantify alternative splicing events from RNA-sequencing data (Fig A). We first analysed publicly available RNA-seq data from The Cancer Genome Atlas (TCGA, n=170). We focussed on non-M3 AML patients with no splicing factor mutations (based on reported genomic sequencing and verified by re-analysis of RNA-seq data from all patients) who had received intensive chemotherapy. We segregated these patients based on their European Leukaemia Net (ELN) risk classification and identified 1290 alternatively spliced events impacting 910 genes that were significantly different (FDR<0.05) between all ELNAdv (n=41) versus all ELNFav patients (n=21, Fig B). The majority were exon skipping events (716 events, 62%, Fig B-C), followed by intron retention (201 events, 15.6%, Fig B). We next used RNA-seq data from a second non-M3 AML patient cohort (ClinSeq- Sweden; ELNAdv, n=75 and ELNFav, n=47), detecting 2507 events mapping to 1566 genes. Comparing across the two cohorts, 222 shared genes were detected to be affected by alternative splicing (Fig D). Ingenuity pathway analysis associated these genes with pathways related to protein translation. In order to prioritise those alternatively spliced events most likely to have a deleterious function, we developed an analytical framework to predict their impact on protein structure (Fig E). 87 alternatively spliced events, 25.81% of the commonly shared splicing events, relating to 78 genes (35.13% of all genes) were predicted to directly alter highly conserved protein domains within the affected genes, leading to either a complete (~25%, Fig E) or a partial loss of a domain (20%, Fig E). These in silico predictions are likely to be an underestimate of the true impact, as splicing alterations mapping to poorly annotated domains or affecting the tertiary structure of proteins would be missed. A number of splicing factors themselves were differentially spliced, with the alternative splicing predicted to have functional consequences. This was exemplified by hnRNPA1, a factor with well-established roles in splicing, is itself alternatively spliced in patients and predicted to be deleterious. Consistent with this, motif scanning analyses indicated that a number of mis-spliced transcripts had hnRNPA1 binding motifs (Fig F). To assess the impact of these alternatively spliced events (that were predicted to also disrupt highly conserved protein domains) on the transcriptome, we simultaneously quantified differential gene expression. IPA analysis of the 602 genes that were differentially expressed between ELNAdv and ELNFav patients and shared between both TCGA and ClinSeq cohorts indicated that they were associated with pathways (Fig G) that were distinct from those associated with aberrantly spliced genes (Fig D). A number of pathways related to inflammation were enriched amongst the genes observed to be upregulated in ELNAdv patients (Fig G). Network analyses integrating the alternatively spliced genes with differentially expressed genes revealed strong interactions (Fig H), indicating functional associations between these biological events. Given these strong network interactions, we investigated the potential prognostic significance of these alternatively spliced events. To this end, we utilised machine-learning methods to derive a "splicing signature" of four mis-spliced genes with a predictive capacity equivalent to the ELN (Fig I). The splicing signature further refined existing risk prediction algorithms to improve the classification of patients (Fig J). Taken together, we report the presence of extensive deregulation of RNA splicing in AML patients even in the absence of splicing factor mutations. Many of these events were shared in patients with adverse outcomes and their impact on the AML transcriptome points towards vulnerabilities that could be targeted. Figure Disclosures Unnikrishnan: Celgene: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding. Lehmann:TEVA: Consultancy, Membership on an entity's Board of Directors or advisory committees; Pfizer: Membership on an entity's Board of Directors or advisory committees; Abbive: Membership on an entity's Board of Directors or advisory committees. Pimanda:Celgene: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding.


Sign in / Sign up

Export Citation Format

Share Document