FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

Mapping Intimacies ◽

10.1101/2021.11.17.469019 ◽

2021 ◽

Author(s):

Hamid Reza Mohebbi ◽

Nurit Haspel

Keyword(s):

False Positive Rate ◽

Cancer Cell Line ◽

Fusion Transcript ◽

High Volume ◽

Detection Methods ◽

Data Sets ◽

Gene Fusions ◽

Rna Seq ◽

Jaccard Distance ◽

Fusion Detection

Gene fusions events, which are the result of two genes fused together to create a hybrid gene, were first described in cancer cells in the early 1980s. These events are relatively common in many cancers including prostate, lymphoid, soft tissue, and breast. Recent advances in next-generation sequencing (NGS) provide a high volume of genomic data, including cancer genomes. The detection of possible gene fusions requires fast and accurate methods. However, current methods suffer from inefficiency, lack of sufficient accuracy, and a high false-positive rate. We present an RNA-Seq fusion detection method that uses dimensionality reduction and parallel computing to speed up the computation. We convert the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance. The detection of candidates is followed by refinement. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq datasets. Paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. The results are compared against the state-of-the-art-methods such as STAR-Fusion, InFusion, and TopHat-Fusion. Our results show that FDJD exhibits superior accuracy compared to popular alternative fusion detection methods. We achieved 90% accuracy on simulated fusion transcript inputs, which is the highest among the compared methods while maintaining comparable run time.

Download Full-text

STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq

10.1101/120295 ◽

2017 ◽

Cited By ~ 69

Author(s):

Brian J. Haas ◽

Alex Dobin ◽

Nicolas Stransky ◽

Bo Li ◽

Xiao Yang ◽

...

Keyword(s):

Fusion Transcript ◽

Superior Performance ◽

Detection Methods ◽

Molecular Evidence ◽

Detection Accuracy ◽

Fusion Genes ◽

Rna Seq ◽

Accurate Identification ◽

Large Numbers ◽

Fusion Detection

AbstractMotivationFusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data.ResultsWe benchmarked STAR-Fusion’s fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods.Availability and implementationSTAR-Fusion is implemented in Perl, freely available as open source software at http://star-fusion.github.io, and supported on [email protected]

Download Full-text

Sensitive, reliable, and robust circRNA detection from RNA-seq with CirComPara2

10.1101/2021.02.18.431705 ◽

2021 ◽

Author(s):

Enrico Gaffo ◽

Alessia Buratin ◽

Anna Dal Molin ◽

Stefania Bortoluzzi

Keyword(s):

Real Data ◽

Detection Algorithm ◽

Detection Methods ◽

Circular Rnas ◽

Data Sets ◽

Rna Seq ◽

Bioinformatics Tool ◽

Diverse Data ◽

High Throughput Study ◽

Discovery Rates

AbstractCurrent methods for identifying circular RNAs (circRNAs) suffer from low discovery rates and inconsistent performance in diverse data sets. Therefore, the applied detection algorithm can bias high-throughput study findings by missing relevant circRNAs. Here, we show that our bioinformatics tool CirComPara2 (https://github.com/egaffo/CirComPara2), by combining multiple circRNA detection methods, consistently achieves high recall rates without loss of precision in simulated and different real-data sets.

Download Full-text

Alignment-free filtering for cfNA fusion fragments

Bioinformatics ◽

10.1093/bioinformatics/btz346 ◽

2019 ◽

Vol 35 (14) ◽

pp. i225-i232 ◽

Cited By ~ 2

Author(s):

Xiao Yang ◽

Yasushi Saito ◽

Arjun Rao ◽

Hyunsung John Kim ◽

Pranav Singh ◽

...

Keyword(s):

Nucleic Acid ◽

Cell Line ◽

De Novo ◽

High Sensitivity ◽

Detection Methods ◽

Rna Seq ◽

Sequencing Data ◽

Alignment Free ◽

Fusion Detection ◽

High Depth

Abstract Motivation Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment lengths and specialized barcodes, such as unique molecular identifiers. Results AF4 was developed to address these challenges. It uses a novel alignment-free kmer-based method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as clinical and cell-line cfNA data. Availability and implementation AF4 is open sourced, licensed under Apache License 2.0, and is available at: https://github.com/grailbio/bio/tree/master/fusion.

Download Full-text

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

10.1101/020784 ◽

2015 ◽

Cited By ~ 7

Author(s):

David M Rocke ◽

Luyao Ruan ◽

Yilun Zhang ◽

J. Jared Gossett ◽

Blythe Durbin-Johnson ◽

...

Keyword(s):

Linear Model ◽

False Positive ◽

Negative Binomial ◽

False Positive Rate ◽

Real Data ◽

False Positives ◽

P Value ◽

Data Sets ◽

Rna Seq ◽

Positive Rate

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

Download Full-text

Fusion detection and quantification by pseudoalignment

10.1101/166322 ◽

2017 ◽

Cited By ~ 10

Author(s):

Páll Melsted ◽

Shannon Hateley ◽

Isaac Charles Joseph ◽

Harold Pimentel ◽

Nicolas Bray ◽

...

Keyword(s):

De Novo ◽

Chromosomal Rearrangements ◽

Clinical Use ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Sequencing Data ◽

Transcript Quantification ◽

Novel Approach ◽

Fusion Detection

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly

Download Full-text

Fusion Transcript Detection from RNA-Seq using Jaccard Distance

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics ◽

10.1145/3388440.3415585 ◽

2020 ◽

Author(s):

Hamidreza Mohebbi ◽

Nurit Haspel ◽

Dan Simovici ◽

Joyce Quach

Keyword(s):

Fusion Transcript ◽

Rna Seq ◽

Jaccard Distance ◽

Transcript Detection

Download Full-text

Gene Fusion Detection By RNA-Seq in Acute Myeloid Leukemia (AML)

Blood ◽

10.1182/blood-2019-125869 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 4655-4655

Author(s):

Paul Kerbs ◽

Aarif Mohamed Nazeer Batcha ◽

Sebastian Vosberg ◽

Dirk Metzler ◽

Tobias Herold ◽

...

Keyword(s):

Chromosomal Aberrations ◽

Gene Fusion ◽

Fusion Transcript ◽

Clinical Diagnostics ◽

Fusion Genes ◽

Rna Seq ◽

Fusion Event ◽

Routine Diagnostics ◽

Partner Gene ◽

Fusion Detection

Accurate and complete genetic classification of AML is crucial for the prediction of clinical outcome and treatment stratification. Deciphering the spectrum of genetic abnormalities by polymerase chain reaction (PCR), karyotyping and fluorescence in situ hybridization (FISH) in routine diagnostics is the current gold standard, however, fusion genes might potentially be missed by these assays. Recently, several methods have been developed to improve the detection of gene fusion transcripts based on RNA sequencing data, providing robust results. To test the detection power and assess the applicability of RNA-Seq based methods in clinical diagnostics we applied two different algorithms, namely FusionCatcher (Nicorici D et al., bioRxiv, 2014) and Arriba (Uhrig S et al., DKFZ, https://github.com/suhrig/arriba), to the transcriptomes of 895 well-characterized AML samples from three independently sequenced cohorts: AMLCG (Herold T et al., Haematologica, 2018, n=261), DKTK (Greif PA et al., Clin Cancer Res, 2018 and unpublished data, n=166), BeatAML (Tyner JW et al., Nature 2018, n=468) and publicly available healthy control samples (SRA studies: SRP018028, SRP047126, SRP050146, SRP105369, SRP115911, SRP133442, n=38). According to karyotyping, 31% (277/895) of samples harbored chromosomal aberrations putatively causing gene fusions (i.e. translocations, interstitial deletions, duplications, inversions, insertions). Analyses by FISH and/or PCR confirmed these rearrangements in 51.3% (142/277) of samples, whereas fusion detection by the means of RNA-Seq showed evidence for fusion genes corresponding to these rearrangements in 60.3% (167/277) of samples. Chromosomal aberrations, identified by karyotyping, which are known to result in clinically relevant fusions (e.g. RUNX1-RUNX1T1, KMT2A fusions) were confirmed by FISH/PCR (AMLCG: n=27/27, DKTK: n=21/21, BeatAML: n=54/57) and RNA-Seq based methods (AMLCG: n=17/27, DKTK: n=21/21, BeatAML: n=56/57) in most of the cases. Of note, the AMLCG cohort was sequenced using the SENSE mRNA Library Prep Kit from Lexogen which seems to be not optimal for fusion detection. Furthermore, 19 samples (AMLCG: n=12, DKTK: n=4, BeatAML: n=3) were found to harbor known pathogenic fusions, described in previous studies, which were not reported by routine diagnostics: NUP98-NSD1 (n=11); CBFB-MYH11, RUNX1-RUNX1T1 and DEK-NUP214 (n=2 each); RUNX1-CBFA2T2 and RUNX1-CBFA2T3 (n=1 each). Reanalysis of six of these samples by PCR confirmed three fusions which were initially missed by routine diagnostics. In general, the amount of reported fusion events by RNA-Seq is high (on average 69 and 39 per sample as detected by FusionCatcher and Arriba respectively), even after applying the built-in filters, indicating a high false positive rate. To robustly identify putative novel fusions, we developed a filtering pipeline and incorporated two new filtering steps. The promiscuity score (PS) of a fusion measures the amount of further distinct fusion partners which were detected in the respective cohort for the 5' and 3' gene. The fusion transcript score (FTS) measures the relative abundance of a fusion transcript to its 5' and 3' partner gene. PS and FTS of known, clinically relevant fusions confirmed by FISH/PCR were used to define cut-offs. To further maximize specificity while maintaining sensitivity, we excluded fusion events which we detected in publicly available healthy samples and subsequently filtered for overlapping calls from FusionCatcher and Arriba (Fig. 1A). Additionally, we obtained further evidence for a fusion event by an elevated transcription of the 3' fusion partner. In case of a fusion event, the transcription of the 3' partner gene likely gets under the control of the promoter of the 5' partner gene. This results in an elevated transcription of genes which are otherwise transcribed at low levels (Fig. 1B-C). Thus, we identified five putatively novel recurrent fusion genes which were detected in two cohorts independently: NRIP1-MIR99AHG, LATS2-ZMYM2, ATP11A-ING1, MBP-SLC66A2, PRDM16-SKI (Fig. 1D-F). Although these events were called with high evidence, we aim at independent validation by complementary methods. In our study, we have not only demonstrated that the application of RNA-Seq to the detection of fusion genes is a valuable complement to diagnostic routine but also has the potential to discover novel putatively pathogenic fusions. Disclosures No relevant conflicts of interest to declare.

Download Full-text

LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing

BMC Genomics ◽

10.1186/s12864-020-07207-4 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Qian Liu ◽

Yu Hu ◽

Andres Stucky ◽

Li Fang ◽

Jiang F. Zhong ◽

...

Keyword(s):

Candidate Gene ◽

Gene Fusion ◽

Superior Performance ◽

Gene Fusions ◽

Rna Seq ◽

Cdna Sequencing ◽

Sequencing Data ◽

Mrna Sequencing ◽

Long Read ◽

Fusion Detection

Abstract Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.

Download Full-text

Comprehensive Multi-Omics Analysis of Gene Fusions in a Large Multiple Myeloma Cohort

Blood ◽

10.1182/blood-2018-99-117245 ◽

2018 ◽

Vol 132 (Supplement 1) ◽

pp. 1898-1898

Author(s):

Steven M. Foltz ◽

Qingsong Gao ◽

Christopher J. Yoon ◽

Amila Weerasinghe ◽

Hua Sun ◽

...

Keyword(s):

Multiple Myeloma ◽

Board Of Directors ◽

Research Funding ◽

Gene Fusion ◽

Gene Fusions ◽

Rna Seq ◽

Advisory Committees ◽

Time Points ◽

Detection Algorithms ◽

Fusion Detection

Abstract Introduction: Gene fusions are the result of genomic rearrangements that create hybrid protein products or bring the regulatory elements of one gene into close proximity of another. Fusions often dysregulate gene function or expression through oncogene overexpression or tumor suppressor underexpression (Gao, Liang, Foltz, et al. Cell Rep 2018). Some fusions such as EML4--ALK in lung adenocarcinoma are known druggable targets. Fusion detection algorithms utilize discordantly mapped RNA-seq reads. Careful consideration of detection and filtering procedures is vital for large-scale fusion detection because current methods are prone to reporting false positives and show poor concordance. Multiple myeloma (MM) is a blood cancer in which rapidly expanding clones of plasma cells spread in the bone marrow. Translocations that juxtapose the highly-expressed IGH enhancer with potential oncogenes are associated with overexpression of partner genes, although they may not lead to a detectable gene fusion in RNA-seq data. Previous studies have explored the fusion landscape of multiple myeloma cohorts (Cleynen, et al. Nat Comm 2017; Nasser, et al. Blood 2017). In this study, we developed a novel gene fusion detection pipeline and post-processing strategy to analyze 742 patient samples at the primary time point and 64 samples at follow-up time points (806 total samples) from the Multiple Myeloma Research Foundation (MMRF) CoMMpass Study using RNA-seq, WGS, and clinical data. Methods and Results: We overlapped five fusion detection algorithms (EricScript, FusionCatcher, INTEGRATE, PRADA, and STAR-Fusion) to report fusion events. Our filtered call set consisted of 2,817 fusions with a median of 3 fusions per sample (mean 3.8), similar to glioblastoma, breast, ovarian, and prostate cancers in TCGA. Major recurrent fusions involving immunoglobulin genes included IGH--WHSC1 (88 primary samples), IGL--BMI1 (29), and the upstream neighbor of MYC, PVT1, paired with IGH (6), IGK (3), and IGL (11). For each event, we used WGS data when available to determine if there was genomic support of the gene fusion (based on discordant WGS reads, SV event detection, and MMRF CoMMpass Seq-FISH WGS results) (Miller, et al. Blood 2016). WGS validation rates varied by the level of RNA-seq evidence supporting each fusion, with an overall rate of 24.1%, which is comparable to previously observed pan-cancer validation rates using low-pass WGS. We calculated the association between fusion status and gene expression and identified genes such as BCL2L11, CCND1/2, LTBR, and TXNDC5 that showed significant overexpression (t-test). We explored the clinical connections of fusion events through survival analysis and clinical data correlations, and by mining potentially druggable targets from our Database of Evidence for Precision Oncology (dinglab.wustl.edu/depo) (Sun, Mashl, Sengupta, et al. Bioinformatics 2018). Major examples of upregulated fusion kinases that could potentially be targeted with off-label drug use include FGFR3 and NTRK1. We examined the evolution of fusion events over multiple time points. In one MMRF patient with a t(8;14) translocation joining the IGH locus and transcription factor MAFA, we observed IGH fusions with TOP1MT (neighbor of MAFA) at all four time points with corresponding high expression of TOP1MT and MAFA. Using non-MMRF single-cell RNA data from different patients, we were able to track cell-type composition over time as well as detect subpopulations of cells harboring fusions at different time points with potential treatment implications. Discussion: Gene fusions offer potential targets for alternative MM therapies. Careful implementation of gene fusion detection algorithms and post-processing are essential in large cohort studies to reduce false positives and enrich results for clinically relevant information. Clinical fusion detection from untargeted RNA-seq remains a challenge due to poor sensitivity, specificity, and usability. By combining MMRF CoMMpass data from multiple platforms, we have produced a comprehensive fusion profile of 742 MM patients. We have shown novel gene fusion associations with gene expression and clinical data, and we identified candidates for druggability studies. Disclosures Vij: Bristol-Myers Squibb: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Celgene: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Jazz Pharmaceuticals: Honoraria, Membership on an entity's Board of Directors or advisory committees; Jansson: Honoraria, Membership on an entity's Board of Directors or advisory committees; Amgen: Honoraria, Membership on an entity's Board of Directors or advisory committees; Karyopharma: Honoraria, Membership on an entity's Board of Directors or advisory committees; Takeda: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding.

Download Full-text

Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector

10.1101/2021.08.02.454639 ◽

2021 ◽

Author(s):

Brian Haas ◽

Alexander Dobin ◽

Mahmoud Ghandi ◽

Ann Van Arsdale ◽

Timothy L. Tickle ◽

...

Keyword(s):

Machine Learning ◽

In Silico ◽

Fusion Transcript ◽

Gene Fusions ◽

Rna Seq ◽

Multiple Sources ◽

Fusion Transcripts ◽

Normal Tissues ◽

Reconstruction Methods ◽

In Silico Characterization

Background Gene fusions play a key role as driving oncogenes in tumors, and their reliable discovery and detection is important for cancer research, diagnostics, prognostics and guiding personalized therapy. While discovering gene fusions from genome sequencing can be laborious and costly, the resulting 'fusion transcripts' can be recovered from RNA-seq data of tumor and normal samples. However, alleged and putative fusion transcript can arise from multiple sources in addition to the chromosomal rearrangements yielding fusion genes, including cis- or trans-splicing events, experimental artifacts during RNA-seq or computational errors of transcriptome reconstruction methods. Understanding how to discern, interpret, categorize, and verify predicted fusion transcripts is essential for consideration in clinical settings and prioritization for further research. Here, we present FusionInspector for in silico characterization and interpretation of candidate fusion transcripts from RNA-seq, enabling exploration of sequence and expression characteristics of fusions and their partner genes. Results We applied FusionInspector to thousands of tumor and normal transcriptomes, and identified statistical and experimental features enriched among biologically impactful fusions. Through clustering and machine learning, we identified large collections of fusions potentially relevant to tumor and normal biological processes. We show that biologically relevant fusions are enriched for relatively high expression of the fusion transcript, imbalanced fusion allelic ratios, and canonical splicing patterns, and are deficient in sequence microhomologies detected between partner genes. Conclusion We demonstrate FusionInspector to accurately in silico validate fusion transcripts, and to help identify numerous understudied fusions in tumor and normal tissues samples. FusionInspector is freely available as open source for screening, characterization, and visualization of candidate fusions via RNA-seq. We believe that this work will continue driving the discipline of transparent explanation and interpretation of machine learning predictions and tracing the predictions to their experimental sources.

Download Full-text