LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing

Abstract Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.

Download Full-text

Fusion detection and quantification by pseudoalignment

10.1101/166322 ◽

2017 ◽

Cited By ~ 10

Author(s):

Páll Melsted ◽

Shannon Hateley ◽

Isaac Charles Joseph ◽

Harold Pimentel ◽

Nicolas Bray ◽

...

Keyword(s):

De Novo ◽

Chromosomal Rearrangements ◽

Clinical Use ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Sequencing Data ◽

Transcript Quantification ◽

Novel Approach ◽

Fusion Detection

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly

Download Full-text

Comprehensive Multi-Omics Analysis of Gene Fusions in a Large Multiple Myeloma Cohort

Blood ◽

10.1182/blood-2018-99-117245 ◽

2018 ◽

Vol 132 (Supplement 1) ◽

pp. 1898-1898

Author(s):

Steven M. Foltz ◽

Qingsong Gao ◽

Christopher J. Yoon ◽

Amila Weerasinghe ◽

Hua Sun ◽

...

Keyword(s):

Multiple Myeloma ◽

Board Of Directors ◽

Research Funding ◽

Gene Fusion ◽

Gene Fusions ◽

Rna Seq ◽

Advisory Committees ◽

Time Points ◽

Detection Algorithms ◽

Fusion Detection

Abstract Introduction: Gene fusions are the result of genomic rearrangements that create hybrid protein products or bring the regulatory elements of one gene into close proximity of another. Fusions often dysregulate gene function or expression through oncogene overexpression or tumor suppressor underexpression (Gao, Liang, Foltz, et al. Cell Rep 2018). Some fusions such as EML4--ALK in lung adenocarcinoma are known druggable targets. Fusion detection algorithms utilize discordantly mapped RNA-seq reads. Careful consideration of detection and filtering procedures is vital for large-scale fusion detection because current methods are prone to reporting false positives and show poor concordance. Multiple myeloma (MM) is a blood cancer in which rapidly expanding clones of plasma cells spread in the bone marrow. Translocations that juxtapose the highly-expressed IGH enhancer with potential oncogenes are associated with overexpression of partner genes, although they may not lead to a detectable gene fusion in RNA-seq data. Previous studies have explored the fusion landscape of multiple myeloma cohorts (Cleynen, et al. Nat Comm 2017; Nasser, et al. Blood 2017). In this study, we developed a novel gene fusion detection pipeline and post-processing strategy to analyze 742 patient samples at the primary time point and 64 samples at follow-up time points (806 total samples) from the Multiple Myeloma Research Foundation (MMRF) CoMMpass Study using RNA-seq, WGS, and clinical data. Methods and Results: We overlapped five fusion detection algorithms (EricScript, FusionCatcher, INTEGRATE, PRADA, and STAR-Fusion) to report fusion events. Our filtered call set consisted of 2,817 fusions with a median of 3 fusions per sample (mean 3.8), similar to glioblastoma, breast, ovarian, and prostate cancers in TCGA. Major recurrent fusions involving immunoglobulin genes included IGH--WHSC1 (88 primary samples), IGL--BMI1 (29), and the upstream neighbor of MYC, PVT1, paired with IGH (6), IGK (3), and IGL (11). For each event, we used WGS data when available to determine if there was genomic support of the gene fusion (based on discordant WGS reads, SV event detection, and MMRF CoMMpass Seq-FISH WGS results) (Miller, et al. Blood 2016). WGS validation rates varied by the level of RNA-seq evidence supporting each fusion, with an overall rate of 24.1%, which is comparable to previously observed pan-cancer validation rates using low-pass WGS. We calculated the association between fusion status and gene expression and identified genes such as BCL2L11, CCND1/2, LTBR, and TXNDC5 that showed significant overexpression (t-test). We explored the clinical connections of fusion events through survival analysis and clinical data correlations, and by mining potentially druggable targets from our Database of Evidence for Precision Oncology (dinglab.wustl.edu/depo) (Sun, Mashl, Sengupta, et al. Bioinformatics 2018). Major examples of upregulated fusion kinases that could potentially be targeted with off-label drug use include FGFR3 and NTRK1. We examined the evolution of fusion events over multiple time points. In one MMRF patient with a t(8;14) translocation joining the IGH locus and transcription factor MAFA, we observed IGH fusions with TOP1MT (neighbor of MAFA) at all four time points with corresponding high expression of TOP1MT and MAFA. Using non-MMRF single-cell RNA data from different patients, we were able to track cell-type composition over time as well as detect subpopulations of cells harboring fusions at different time points with potential treatment implications. Discussion: Gene fusions offer potential targets for alternative MM therapies. Careful implementation of gene fusion detection algorithms and post-processing are essential in large cohort studies to reduce false positives and enrich results for clinically relevant information. Clinical fusion detection from untargeted RNA-seq remains a challenge due to poor sensitivity, specificity, and usability. By combining MMRF CoMMpass data from multiple platforms, we have produced a comprehensive fusion profile of 742 MM patients. We have shown novel gene fusion associations with gene expression and clinical data, and we identified candidates for druggability studies. Disclosures Vij: Bristol-Myers Squibb: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Celgene: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Jazz Pharmaceuticals: Honoraria, Membership on an entity's Board of Directors or advisory committees; Jansson: Honoraria, Membership on an entity's Board of Directors or advisory committees; Amgen: Honoraria, Membership on an entity's Board of Directors or advisory committees; Karyopharma: Honoraria, Membership on an entity's Board of Directors or advisory committees; Takeda: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding.

Download Full-text

MetaFusion: A high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates

10.1101/2020.09.17.302307 ◽

2020 ◽

Author(s):

Michael Apostolides ◽

Yue Jiang ◽

Mia Husić ◽

Robert Siddaway ◽

Cynthia Hawkins ◽

...

Keyword(s):

Gene Fusion ◽

Graph Clustering ◽

Gene Fusions ◽

Rna Seq ◽

High Confidence ◽

Link Type ◽

Final Output ◽

The Right ◽

And Function ◽

Fusion Detection

AbstractMotivationGene fusions are often associated with cancer, yet current fusion detection tools vary in their calling approaches, making selecting the right tool challenging. Ensemble fusion calling techniques appear promising; however, current options have limited accessibility and function.ResultsMetaFusion is a flexible meta-calling tool that amalgamates the outputs from any number of fusion callers. Results from individual callers are converted into Common Fusion Format, a new file type that standardizes outputs from callers. Calls are then annotated, merged using graph clustering, filtered and ranked to provide a final output of high confidence candidates. MetaFusion consistently outperformed individual callers with respect to recall and precision on real and simulated datasets, achieving up to 100% precision. Thus, an ensemble calling approach is imperative for high confidence results. MetaFusion also labels fusions found in databases using the FusionAnnotator package, and is provided with a benchmarking toolkit to calibrate new callers.AvailabilityMetaFusion is freely available at https://github.com/ccmbioinfo/[email protected]

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab028 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Long Read

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text

Statistical assessment of gene fusion detection algorithms using RNA Sequencing Data

2012 IEEE Statistical Signal Processing Workshop (SSP) ◽

10.1109/ssp.2012.6319801 ◽

2012 ◽

Cited By ~ 1

Author(s):

Vinay Varadan ◽

Angel Janevski ◽

Sitharthan Kamalakaran ◽

Nilanjana Banerjee ◽

Nevenka Dimitrova ◽

...

Keyword(s):

Rna Sequencing ◽

Gene Fusion ◽

Sequencing Data ◽

Statistical Assessment ◽

Detection Algorithms ◽

Fusion Detection

Download Full-text

Gene fusions evidence in a KIT/PDGFRA wild-type GIST without mutations in SDH units identified by a whole transcriptome study.

Journal of Clinical Oncology ◽

10.1200/jco.2013.31.15_suppl.e21523 ◽

2013 ◽

Vol 31 (15_suppl) ◽

pp. e21523-e21523

Author(s):

Milena Urbini ◽

Annalisa Astolfi ◽

Valentina Indio ◽

Maristella Saponara ◽

Margherita Nannini ◽

...

Keyword(s):

Gene Fusion ◽

Surgical Removal ◽

Erk Pathway ◽

Gene Fusions ◽

Rna Seq ◽

Wild Type ◽

Rna Pol Ii ◽

Molecular Events ◽

Frame Fusion ◽

Whole Transcriptome

e21523 Background: A subset of KIT/PDGFRA wild-type GIST (WT) harbour mutations in SDH units. In the majority of the remaining cases of WT GIST no other molecular events are identified.We performed a RNA-seq in a WT GIST without mutations in SDH genes using next generation approach to discover molecular events in this GIST population. Methods: In 2003, a 63-year old woman underwent surgery for an ileal GIST (size 6 cm, MI 6/50HPF).After 6 years, she developed a recurrence with a single hepatic lesion. The KIT and PDGFRA analysis of the lesion did not show mutations. Therefore, she did not receive imatinib but she underwent a surgical removal. The analysis of all SDH units did not show mutations. So paired-end RNA-seq (75X2) was performed with Illumina HiScanSQ platform. After mapping the short reads on the human genome(HG19), SNVs and InDels were called by SNVMix2 with an accurate filtering procedures including predictors of mutations effect at protein level. Gene fusions discovery was done considering the agreement between DeFuse, ChimeraScan and FusionMap tools and validated by SangerSequencing using primers spanning the mRNA breakpoints. Results: Four different gene fusions and 206 non-synonymous SNVs were discovered, of which 62 were called deleterious by at least one predictor, and they are undergoing further validation. SPRED2-NELFCD gene fusion originated from an interchromosomal translocation-inversion between chr 20 and 2. The event involved exon1 of SPRED2 and exon11 of NELFCD, probably leading to inactivation of both genes. NELFCD encodes a component of the NELF complex that negatively regulates transcription elongation by RNA pol II, while SPRED2 is a member of the Sprouty /SPRED family that repress growth factor-induced activation of the MAPK/ERK pathway. The other three events were intrachromosomal aberrations: MARK2-PPFIA1 and PLA2G16-ATL3 on chr 11 and ASCC1-C10orf11 on chr 10. Only the first event led to an in-frame fusion (MARK2 ex1- PPFIA1 ex2) probably dysregulating the expression of the downstream gene. Conclusions: This is the first evidence of gene fusions in GIST. The oncogenetic role and the tumor frequency of these events deserve to be studied.

Download Full-text

SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

BioMed Research International ◽

10.1155/2015/780519 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Yuxiang Tan ◽

Yann Tambouret ◽

Stefano Monti

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Performance Metrics ◽

Simulated Data ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Detection Algorithms ◽

Fusion Detection

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

annoFuse: an R Package to annotate, prioritize, and interactively explore putative oncogenic RNA fusions

10.1101/839738 ◽

2019 ◽

Author(s):

Krutika S. Gaonkar ◽

Federico Marini ◽

Komal S. Rathi ◽

Payal Jain ◽

Yuankun Zhu ◽

...

Keyword(s):

Brain Tumor ◽

Web Application ◽

Gene Fusion ◽

Protein Domains ◽

Pediatric Brain Tumor ◽

R Package ◽

Gene Fusions ◽

Rna Seq ◽

Somatic Variation ◽

Pediatric Brain

AbstractBackgroundGene fusion events are a significant source of somatic variation across adult and pediatric cancers and are some of the most clinically-effective therapeutic targets, yet low consensus of RNA-Seq fusion prediction algorithms makes therapeutic prioritization difficult. In addition, events such as polymerase read-throughs, mis-mapping due to gene homology, and fusions occurring in healthy normal tissue require informed filtering, making it difficult for researchers and clinicians to rapidly discern gene fusions that might be true underlying oncogenic drivers of a tumor and in some cases, appropriate targets for therapy.ResultsWe developed annoFuse, an R package, and shinyFuse, a companion web application, to annotate, prioritize, and explore biologically-relevant expressed gene fusions, downstream of fusion calling. We validated annoFuse using a random cohort of TCGA RNA-Seq samples (N = 160) and achieved a 96% sensitivity for retention of high-confidence fusions (N = 603). annoFuse uses FusionAnnotator annotations to filter non-oncogenic and/or artifactual fusions. Then, fusions are prioritized if previously reported in TCGA and/or fusions containing gene partners that are known oncogenes, tumor suppressor genes, COSMIC genes, and/or transcription factors. We applied annoFuse to fusion calls from pediatric brain tumor RNA-Seq samples (N = 1,028) provided as part of the Open Pediatric Brain Tumor Atlas (OpenPBTA) Project to determine recurrent fusions and recurrently-fused genes within different brain tumor histologies. annoFuse annotates protein domains using the PFAM database, assesses reciprocality, and annotates gene partners for kinase domain retention. As a standard function, reportFuse enables generation of a reproducible R Markdown report to summarize filtered fusions, visualize breakpoints and protein domains by transcript, and plot recurrent fusions within cohorts. Finally, we created shinyFuse for algorithm-agnostic interactive exploration and plotting of gene fusions.ConclusionsannoFuse provides standardized filtering and annotation for gene fusion calls from STARFusion and Arriba by merging, filtering, and prioritizing putative oncogenic fusions across large cancer datasets, as demonstrated here with data from the OpenPBTA project. We are expanding the package to be widely-applicable to other fusion algorithms and expect annoFuse to provide researchers a method for rapidly evaluating, prioritizing, and translating fusion findings in patient tumors.

Download Full-text

The landscape of gene fusions in hepatocellular carcinoma

10.1101/055376 ◽

2016 ◽

Author(s):

Chengpei Zhu ◽

Yanling Lv ◽

Liangcai Wu ◽

Jinxia Guan ◽

Xue Bai ◽

...

Keyword(s):

Hepatocellular Carcinoma ◽

Cancer Progression ◽

Gene Fusion ◽

Early Stage ◽

Pcr Analysis ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Rt Pcr ◽

Critical Function

AbstractMost hepatocellular carcinoma (HCC) patients are diagnosed at advanced stages and suffer limited treatment options. Challenges in early stage diagnosis may be due to the genetic complexity of HCC. Gene fusion plays a critical function in tumorigenesis and cancer progression in multiple cancers, yet the identities of fusion genes as potential diagnostic markers in HCC have not been investigated.Paired-end RNA sequencing was performed on noncancerous and cancerous lesions in two representative HBV-HCC patients. Potential fusion genes were identified by STAR-Fusion in STAR software and validated by four publicly available RNA-seq datasets. Fourteen pairs of frozen HBV-related HCC samples and adjacent non-tumor liver tissues were examined by RT-PCR analysis for gene fusion expression.We identified 2,354 different gene fusions in the two HBV-HCC patients. Validation analysis against the four RNA-seq datasets revealed only 1.8% (43/2,354) as recurrent fusions that were supported by public datasets. Comparison with four fusion databases demonstrated that three (HLA-DPB2-HLA-DRB1, CDH23-HLA-DPB1, and C15orf57-CBX3) out of 43 recurrent gene fusions were annotated as disease-related fusion events. Nineteen were novel recurrent fusions not previously annotated to diseases, including DCUN1D3-GSG1L and SERPINA5-SERPINA9. RT-PCR and Sanger sequencing of 14 pairs of HBV-related HCC samples confirmed expression of six of the new fusions, including RP11-476K15.1-CTD-2015H3.2.Our study provides new insights into gene fusions in HCC and could contribute to the development of anti-HCC therapy. RP11–476K15.1-CTD–2015H3.2 may serve as a new therapeutic biomarker in HCC.

Download Full-text