ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads

Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings. Availability and implementation ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Clinker: visualising fusion genes detected in RNA-seq data

10.1101/218586 ◽

2017 ◽

Author(s):

Breon M Schmidt ◽

Nadia M Davidson ◽

Anthony DK Hawkins ◽

Ray Bartolo ◽

Ian J Majewski ◽

...

Keyword(s):

Acute Lymphoblastic Leukaemia ◽

B Cell ◽

Lymphoblastic Leukaemia ◽

Fusion Gene ◽

Therapeutic Targets ◽

Genomic Profiling ◽

Fusion Genes ◽

Rna Seq ◽

Bioinformatics Tool ◽

Rich Diversity

ABSTRACTGenomic profiling efforts have revealed a rich diversity of oncogenic fusion genes, and many are emerging as important therapeutic targets. While there are many ways to identify fusion genes from RNA-seq data, visualising these transcripts and their supporting reads remains challenging. Clinker is a bioinformatics tool written in Python, R and Bpipe, that leverages the superTranscript method to visualise fusion genes. We demonstrate the use of Clinker to obtain interpretable visualisations of the RNA-seq data that lead to fusion calls. In addition, we use Clinker to explore multiple fusion transcripts with novel breakpoints within the P2RY8-CRLF2 fusion gene in B-cell Acute Lymphoblastic Leukaemia (B-ALL).Availability and ImplementationClinker is freely available from Github https://github.com/Oshlack/Clinker under a MIT [email protected]

Download Full-text

No one tool to rule them all: Prokaryotic gene prediction tool performance is highly dependent on the organism of study

10.1101/2021.05.21.445150 ◽

2021 ◽

Author(s):

Nicholas J. Dimonaco ◽

Wayne Aubrey ◽

Kim Kenobi ◽

Amanda Clare ◽

Christopher J. Creevey

Keyword(s):

Gene Prediction ◽

Evaluation Framework ◽

Model Organisms ◽

Prediction Tool ◽

Genomic Databases ◽

Reading Frame ◽

Prediction Tools ◽

Tool Performance ◽

Genome Annotations ◽

The Right

Motivation: The biases in Open Reading Frame (ORF) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any ORF prediction tool and allow them to choose the right tool for their analysis. Results: We present an evaluation framework ("ORForise") based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of ORF prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 it ab initio and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.

Download Full-text

Fusion detection and quantification by pseudoalignment

10.1101/166322 ◽

2017 ◽

Cited By ~ 10

Author(s):

Páll Melsted ◽

Shannon Hateley ◽

Isaac Charles Joseph ◽

Harold Pimentel ◽

Nicolas Bray ◽

...

Keyword(s):

De Novo ◽

Chromosomal Rearrangements ◽

Clinical Use ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Sequencing Data ◽

Transcript Quantification ◽

Novel Approach ◽

Fusion Detection

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

Bioinformatics ◽

10.1093/bioinformatics/btz778 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul D W Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Rna Seq ◽

Clustering Methods ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

Abstract Motivation Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. Results The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation.We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. Availability An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

Improvement of detection performance of fusion genes from RNA-seq data by clustering short reads

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019400080 ◽

2019 ◽

Vol 17 (03) ◽

pp. 1940008 ◽

Cited By ~ 1

Author(s):

Yoshiaki Sota ◽

Shigeto Seno ◽

Hironori Shigeta ◽

Naoki Osato ◽

Masafumi Shimoda ◽

...

Keyword(s):

Fusion Gene ◽

Original Data ◽

Read Length ◽

Fusion Genes ◽

Rna Seq ◽

Gene Detection ◽

Representative Sequence ◽

Multiple Loci ◽

Detection Tool ◽

Mcf 7

Fusion genes are involved in cancer, and their detection using RNA-Seq is insufficient given the relatively short reading length. Therefore, we proposed a shifted short-read clustering (SSC) method, which focuses on overlapping reads from the same loci and extends them as a representative sequence. To verify their usefulness, we applied the SSC method to RNA-Seq data from four types of cell lines (BT-474, MCF-7, SKBR-3, and T-47D). As the slide width of the SSC method increased to one, two, five, or ten bases, the read length was extended from 201 bases to 217 (108%), 234 (116%), 282 (140%), or 317 (158%) bases, respectively. Furthermore, fusion genes were investigated using STAR-Fusion, a fusion gene detection tool, with and without the SSC method. When one base was shifted by the SSC method, the reads mapped to multiple loci decreased from 9.7% to 4.6%, and the sensitivity of the fusion gene was improved from 47% to 54% on average (BT-474: from 48% to 57%, MCF-7: 49% to 53%, SKBR-3: 50% to 57%, and T-47D: 43% to 50%) compared with original data. When the reads are shifted more, the positive predictive value was also improved. The SSC method could be an effective method for fusion gene detection.

Download Full-text

The Fusion Gene Landscape in Taiwanese Patients with Non-Small Cell Lung Cancer

Cancers ◽

10.3390/cancers13061343 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1343

Author(s):

Ya-Sian Chang ◽

Siang-Jyun Tu ◽

Ju-Chen Yen ◽

Ya-Ting Lee ◽

Hsin-Yuan Fang ◽

...

Keyword(s):

Lung Cancer ◽

Small Cell Lung Cancer ◽

Fusion Gene ◽

Small Cell ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Small Cell Lung ◽

Driver Genes ◽

Cancer Driver

Background: Analyzing fusion gene transcripts may yield an effective approach for selecting cancer treatments. However, few comprehensive analyses of fusions in non-small cell lung cancer (NSCLC) patients have been performed. Methods: We enrolled 54 patients with NSCLC, and performed RNA-sequencing (RNA-Seq). STAR (Spliced Transcripts Alignment to a Reference)-Fusion was used to identify fusions. Results: Of the 218 fusions discovered, 24 had been reported and the rest were novel. Three fusions had the highest occurrence rates. After integrating our gene expression and fusion data, we found that samples harboring fusions containing ASXL1, CACNA1A, EEF1A1, and RET also exhibited increased expression of these genes. We then searched for mutations and fusions in cancer driver genes in each sample and found that nine patients carried both mutations and fusions in cancer driver genes. Furthermore, we found a trend for mutual exclusivity between gene fusions and mutations in the same gene, with the exception of DMD, and we found that EGFR mutations are associated with the number of fusion genes. Finally, we identified kinase gene fusions, and potentially druggable fusions, which may play roles in lung cancer therapy. Conclusion: The clinical use of RNA-Seq for detecting driver fusion genes may play an important role in the treatment of lung cancer.

Download Full-text

Frequent Igh Fusion Transcripts with Clinical Impact in Multiple Myeloma

Blood ◽

10.1182/blood.v124.21.721.721 ◽

2014 ◽

Vol 124 (21) ◽

pp. 721-721 ◽

Cited By ~ 1

Author(s):

Alice Cleynen ◽

Raphael Szalat ◽

Mehmet Kemal Samur ◽

Naim Rashid ◽

Giovanni Parmigiani ◽

...

Keyword(s):

Survival Data ◽

Fusion Proteins ◽

Fusion Gene ◽

Expression Profiles ◽

Clinical Impact ◽

Chromosome 1 ◽

Fish Analysis ◽

Significant Heterogeneity ◽

Fusion Genes ◽

Rna Seq

Abstract Background: Significant heterogeneity has been described in Multiple Myloma (MM), especially at the genomic level. Frequent gains and losses of DNA along with various mutations have been observed, and differential allelic expression is being characterized. Fusion proteins are common and maybe associated with cell transformation, growth and lethality of tumor cells. However, unlike in leukemia, no consistent fusion gene product has been consistently identified. As IgH-related translocations have an important role in myeloma, we have investigated fusion genes involving IgH, to understand their biology and explore a possible effect on survival in MM. Methods: We performed deep RNA-Seq on purified MM cells from 430 newly-diagnosed MM patients and analyzed gene expression profiles, isoform signatures and both novel and known fusion genes using two common algorithms: TopHat and MapSplice. We also correlated genomic data with patient data including cytogenetics and FISH, as well as survival. Results: We primarily focused on fusions involving the IGH gene (chromosome 14) and found that about one fourth of the patients (57 out of 430) presented an IGH fusion gene (97 patients according to TopHat, 303 according to Mapsplice, 57 according to both). These included the well described t(4,14) fusion involving the MMSET gene (found in 47 patients by both algorithms, 49 by Tophat, 54 by Mapsplice). Additionally we observed fusions involving chromosomes 14 and chromosomes 1, 4, 11, 12, and 16. The counterpart genes involved in the IgH fusions included PDE3A (chromosome 12 - 4 patients); HFM1 (chromosome 1, 2 patients); NFKB1, FGFR3, CIITA, WWOX and MRPL21 (chromosomes 4, 16, and 11, 1 patient). As RNA-Seq data allows the precise localization of the breakpoints, we were able to identify that out of the 47 t(4,14) patients, 62% were MB4-1, 9.5% were MB4-2 and 28.5% were MB4-3. Interestingly, we did not see fusion products involving IgH and other known parts on Chromosomes 8 and 20. We studied event-free survival in a subset of 265 patients with available survival data and found that, as predicted, patients with an IgH-MMSET fusion had significantly lower survival than others. However, patients with a fusion gene involving IGH and any other partner have a significantly better prognosis as a group. Moreover, the poor prognosis of IgH-MMSET fusion appears to be driven by MB4-3 patients. Importantly, the fusions identified using RNA-seq were also validated by FISH analysis. All t(4,14) fusions were characterized by a very high MMSET expression (FPKM greater than 20) while patients with other fusions presented a lower MMSET expression (FPKM lower than 10). Conclusion: Our study suggests that IgH-related translocations in myeloma may impact tumor biology by a number of mechanism, one of which is the generation of fusion proteins with functional consequences. It also highlights a possible clinical impact that requires validation in larger cohorts. Disclosures No relevant conflicts of interest to declare.

Download Full-text

KusakiDB v1.0: a novel approach for validation and completeness of protein orthologous groups

10.1101/2020.11.09.373753 ◽

2020 ◽

Author(s):

Andrea Ghelfi ◽

Yasukazu Nakamura ◽

Sachiko Isobe

Keyword(s):

Plant Species ◽

Agricultural Sector ◽

Gene Prediction ◽

Point Of View ◽

Supplementary Information ◽

Major Protein ◽

Management Tools ◽

Link Type ◽

Novel Approach ◽

Low Coverage

SummaryPlants have quite a low coverage in the major protein databases despite their roughly 350,000 species. Moreover, the agricultural sector is one of the main categories in bioeconomy. In order to manipulate and/or engineer plant-based products, it is important to understand the essential fabric of an organism, its proteins. Therefore, we created KusakiDB, which is a database of orthologous proteins, in plants, that correlates three major databases, OrthoDB, UniProt and RefSeq. KusakiDB has an orthologs assessment and management tools in order to compare orthologous groups, which can provide insights not only under an evolutionary point of view but also evaluate structural gene prediction quality and completeness among plant species. KusakiDB could be a new approach to reduce error propagation of functional annotation in plant species. Additionally, this method could, potentially, bring to light some orthologs unique to a few species or families that could have evolved at a high evolutionary rate or could have been a result of a horizontal gene transfer.Availability and ImplementationThe software is implemented in R. It is available at http://pgdbjsnp.kazusa.or.jp/app/kusakidb and at https://hub.docker.com/r/ghelfi/kusakidb under the MIT license.Contact:[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Bioinformatic pipelines for whole transcriptome sequencing data exploitation in leukemia patients with complex structural variants

PeerJ ◽

10.7717/peerj.7071 ◽

2019 ◽

Vol 7 ◽

pp. e7071

Author(s):

Jakub Hynst ◽

Karla Plevova ◽

Lenka Radova ◽

Vojtech Bystry ◽

Karol Pal ◽

...

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Gene Expression Analysis ◽

De Novo ◽

Fusion Gene ◽

Fusion Genes ◽

Rna Seq ◽

Differential Gene Expression Analysis ◽

Total Rna ◽

Differential Gene

Background Extensive genome rearrangements, known as chromothripsis, have been recently identified in several cancer types. Chromothripsis leads to complex structural variants (cSVs) causing aberrant gene expression and the formation of de novo fusion genes, which can trigger cancer development, or worsen its clinical course. The functional impact of cSVs can be studied at the RNA level using whole transcriptome sequencing (total RNA-Seq). It represents a powerful tool for discovering, profiling, and quantifying changes of gene expression in the overall genomic context. However, bioinformatic analysis of transcriptomic data, especially in cases with cSVs, is a complex and challenging task, and the development of proper bioinformatic tools for transcriptome studies is necessary. Methods We designed a bioinformatic workflow for the analysis of total RNA-Seq data consisting of two separate parts (pipelines): The first pipeline incorporates a statistical solution for differential gene expression analysis in a biologically heterogeneous sample set. We utilized results from transcriptomic arrays which were carried out in parallel to increase the precision of the analysis. The second pipeline is used for the identification of de novo fusion genes. Special attention was given to the filtering of false positives (FPs), which was achieved through consensus fusion calling with several fusion gene callers. We applied the workflow to the data obtained from ten patients with chronic lymphocytic leukemia (CLL) to describe the consequences of their cSVs in detail. The fusion genes identified by our pipeline were correlated with genomic break-points detected by genomic arrays. Results We set up a novel solution for differential gene expression analysis of individual samples and de novo fusion gene detection from total RNA-Seq data. The results of the differential gene expression analysis were concordant with results obtained by transcriptomic arrays, which demonstrates the analytical capabilities of our method. We also showed that the consensus fusion gene detection approach was able to identify true positives (TPs) efficiently. Detected coordinates of fusion gene junctions were in concordance with genomic breakpoints assessed using genomic arrays. Discussion Byapplying our methods to real clinical samples, we proved that our approach for total RNA-Seq data analysis generates results consistent with other genomic analytical techniques. The data obtained by our analyses provided clues for the study of the biological consequences of cSVs with far-reaching implications for clinical outcome and management of cancer patients. The bioinformatic workflow is also widely applicable for addressing other research questions in different contexts, for which transcriptomic data are generated.

Download Full-text