DEEPrior: a deep learning tool for the prioritization of gene fusions

Marta Lovino; Maria Serena Ciaburri; Gianvito Urgese; Santa Di Cataldo; Elisa Ficarra

doi:10.1093/bioinformatics/btaa069

DEEPrior: a deep learning tool for the prioritization of gene fusions

Bioinformatics ◽

10.1093/bioinformatics/btaa069 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3248-3250

Author(s):

Marta Lovino ◽

Maria Serena Ciaburri ◽

Gianvito Urgese ◽

Santa Di Cataldo ◽

Elisa Ficarra

Keyword(s):

Deep Learning ◽

Amino Acid ◽

Gene Fusion ◽

Supplementary Information ◽

Gene Fusions ◽

Supplementary Data ◽

Learning Tool ◽

Cancer Driver ◽

Open Issue ◽

Passenger Mutation

Abstract Summary In the last decade, increasing attention has been paid to the study of gene fusions. However, the problem of determining whether a gene fusion is a cancer driver or just a passenger mutation is still an open issue. Here we present DEEPrior, an inherently flexible deep learning tool with two modes (Inference and Retraining). Inference mode predicts the probability of a gene fusion being involved in an oncogenic process, by directly exploiting the amino acid sequence of the fused protein. Retraining mode allows to obtain a custom prediction model including new data provided by the user. Availability and implementation Both DEEPrior and the protein fusions dataset are freely available from GitHub at (https://github.com/bioinformatics-polito/DEEPrior). The tool was designed to operate in Python 3.7, with minimal additional libraries. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

KEC: unique sequence search by K-mer exclusion

Bioinformatics ◽

10.1093/bioinformatics/btab196 ◽

2021 ◽

Author(s):

Pavel Beran ◽

Dagmar Stehlíková ◽

Stephen P Cohen ◽

Vladislav Čurn

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Source Code ◽

Unique Sequence ◽

Supplementary Information ◽

Supplementary Data ◽

Laptop Computers ◽

Sequence Search ◽

Target Sequences ◽

Cross Reference

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structured crowdsourcing enables convolutional segmentation of histology images

Bioinformatics ◽

10.1093/bioinformatics/btz083 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3461-3467 ◽

Cited By ~ 12

Author(s):

Mohamed Amgad ◽

Habiba Elfandy ◽

Hagar Hussein ◽

Lamees A Atteya ◽

Mai A T Elsebaie ◽

...

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Classification Accuracy ◽

Supplementary Information ◽

Supplementary Data ◽

Digital Slide ◽

Convolutional Networks ◽

Fully Convolutional Networks ◽

Annotation Data ◽

Whole Slide Images

Abstract Motivation While deep-learning algorithms have demonstrated outstanding performance in semantic image segmentation tasks, large annotation datasets are needed to create accurate models. Annotation of histology images is challenging due to the effort and experience required to carefully delineate tissue structures, and difficulties related to sharing and markup of whole-slide images. Results We recruited 25 participants, ranging in experience from senior pathologists to medical students, to delineate tissue regions in 151 breast cancer slides using the Digital Slide Archive. Inter-participant discordance was systematically evaluated, revealing low discordance for tumor and stroma, and higher discordance for more subjectively defined or rare tissue classes. Feedback provided by senior participants enabled the generation and curation of 20 000+ annotated tissue regions. Fully convolutional networks trained using these annotations were highly accurate (mean AUC=0.945), and the scale of annotation data provided notable improvements in image classification accuracy. Availability and Implementation Dataset is freely available at: https://goo.gl/cNM4EL. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepMSPeptide: peptide detectability prediction using deep learning

Bioinformatics ◽

10.1093/bioinformatics/btz708 ◽

2019 ◽

Author(s):

Guillermo Serrano ◽

Elizabeth Guruceaga ◽

Victor Segura

Keyword(s):

Deep Learning ◽

Protein Detection ◽

Amino Acid Sequences ◽

Supplementary Information ◽

Learning Method ◽

Supplementary Data ◽

Stochastic Nature ◽

Bioinformatic Tool ◽

Peptide Detectability ◽

Detection And Quantification

Abstract Summary The protein detection and quantification using high-throughput proteomic technologies is still challenging due to the stochastic nature of the peptide selection in the mass spectrometer, the difficulties in the statistical analysis of the results and the presence of degenerated peptides. However, considering in the analysis only those peptides that could be detected by mass spectrometry, also called proteotypic peptides, increases the accuracy of the results. Several approaches have been applied to predict peptide detectability based on the physicochemical properties of the peptides. In this manuscript, we present DeepMSPeptide, a bioinformatic tool that uses a deep learning method to predict proteotypic peptides exclusively based on the peptide amino acid sequences. Availability and implementation DeepMSPeptide is available at https://github.com/vsegurar/DeepMSPeptide. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CATHER: a novel threading algorithm with predicted contacts

Bioinformatics ◽

10.1093/bioinformatics/btz876 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2119-2125 ◽

Cited By ~ 1

Author(s):

Zongyang Du ◽

Shuo Pan ◽

Qi Wu ◽

Zhenling Peng ◽

Jianyi Yang

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Structure Prediction ◽

Supplementary Information ◽

Supplementary Data ◽

Contact Map ◽

Test Set ◽

Benchmark Tests ◽

Independent Test ◽

Push Forward

Abstract Motivation Threading is one of the most effective methods for protein structure prediction. In recent years, the increasing accuracy in protein contact map prediction opens a new avenue to improve the performance of threading algorithms. Several preliminary studies suggest that with predicted contacts, the performance of threading algorithms can be improved greatly. There is still much room to explore to make better use of predicted contacts. Results We have developed a new contact-assisted threading algorithm named CATHER using both conventional sequential profiles and contact map predicted by a deep learning-based algorithm. Benchmark tests on an independent test set and the CASP12 targets demonstrated that CATHER made significant improvement over other methods which only use either sequential profile or predicted contact map. Our method was ranked at the Top 10 among all 39 participated server groups on the 32 free modeling targets in the blind tests of the CASP13 experiment. These data suggest that it is promising to push forward the threading algorithms by using predicted contacts. Availability and implementation http://yanglab.nankai.edu.cn/CATHER/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Deep Learning Approach to the Screening of Oncogenic Gene Fusions in Humans

International Journal of Molecular Sciences ◽

10.3390/ijms20071645 ◽

2019 ◽

Vol 20 (7) ◽

pp. 1645 ◽

Cited By ~ 3

Author(s):

Marta Lovino ◽

Gianvito Urgese ◽

Enrico Macii ◽

Santa Di Cataldo ◽

Elisa Ficarra

Keyword(s):

Deep Learning ◽

Gene Fusion ◽

Research Problem ◽

Protein Domain ◽

Gene Fusions ◽

Protein Fusion ◽

Large Dataset ◽

Oncogenic Potential ◽

Fusion Transcripts ◽

Oncogenic Gene

Gene fusions have a very important role in the study of cancer development. In this regard, predicting the probability of protein fusion transcripts of developing into a cancer is a very challenging and yet not fully explored research problem. To this date, all the available approaches in literature try to explain the oncogenic potential of gene fusions based on protein domain analysis, that is cancer-specific and not easy to adapt to newly developed information. In our work, we choose the raw protein sequences as the input baseline, and propose the use of deep learning, and more specifically Convolutional Neural Networks, to infer the oncogenity probability score of gene fusion transcripts and to group them into a number of categories (e.g., oncogenic/not oncogenic). This is an inherently flexible methodology that, unlike previous approaches, can be re-trained with very less efforts on newly available data (for example, from a different cancer). Based on experimental results on a large dataset of pre-annotated gene fusions, our method is able to predict the oncogenity potential of gene fusion transcripts with accuracy of about 72%, which increases to 86% if we consider the only instances that are classified with a high confidence level.

Download Full-text

Protein residues determining interaction specificity in paralogous families

Bioinformatics ◽

10.1093/bioinformatics/btaa934 ◽

2020 ◽

Author(s):

Borja Pitarch ◽

Juan A G Ranea ◽

Florencio Pazos

Keyword(s):

Amino Acid ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Information ◽

Large Set ◽

Supplementary Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Protein Residues

Abstract Motivation Predicting the residues controlling a protein’s interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. Results In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent ‘unsupervised’ method that does not use interactome information. Availability and implementation http://csbg.cnb.csic.es/pazos/Xdet/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

keras_dna: a wrapper for fast implementation of deep learning models in genomics

Bioinformatics ◽

10.1093/bioinformatics/btaa929 ◽

2020 ◽

Author(s):

Etienne Routhier ◽

Ayman Bin Kamruddin ◽

Julien Mozziconacci

Keyword(s):

Deep Learning ◽

Dna Sequences ◽

Supplementary Information ◽

Supplementary Data ◽

Learning Models ◽

Multiple Targets ◽

Fast Implementation ◽

Model Training ◽

High Level

Abstract Summary Prediction of genomic annotations from DNA sequences using deep learning is today becoming a flourishing field with many applications. Nevertheless, there are still difficulties in handling data in order to conveniently build and train models dedicated for specific end-user’s tasks. keras_dna is designed for an easy implementation of Keras models (TensorFlow high level API) for genomics. It can handle standard bioinformatic files formats as inputs such as bigwig, gff, bed, wig, bedGraph or fasta and returns standardized inputs for model training. keras_dna is designed to implement existing models but also to facilitate the development of news models that can have single or multiple targets or inputs. Availability and implementation Freely available with a MIT License using pip install keras_dna or cloning the github repo at https://github.com/etirouthier/keras_dna.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Single cell gene fusion detection by scFusion

10.1101/2020.12.27.424506 ◽

2020 ◽

Author(s):

Zijie Jin ◽

Wenjian Huang ◽

Ning Shen ◽

Juan Li ◽

Xiaochen Wang ◽

...

Keyword(s):

Multiple Myeloma ◽

Deep Learning ◽

Single Cell ◽

Gene Fusion ◽

Gene Fusions ◽

Bulk Data ◽

Single Cell Rna Sequencing ◽

False Discoveries ◽

Deep Learning Model ◽

Fusion Detection

AbstractGene fusions are widespread in tumor cells and can play important roles in tumor initiation and progression. Using full length single cell RNA sequencing (scRNA-seq), gene fusions can now be detected at single cell level by analyzing chimeric reads in scRNA-seq. However, scRNA-seq data has a high noise level and contains various technical artefacts. Direct application of fusion detection tools developed for bulk data can lead to spurious fusion discoveries and leave some true fusions undetected. In this paper, we present a computational tool, scFusion, for gene fusion detection based on scRNA-seq. scFusion is composed of a statistical model and a deep learning model, both of which are designed to control for potential false discoveries. The statistical model models the background noise as zero inflated negative binomial and uses a statistical testing procedure to control for false positives. The deep learning model is trained to recognize technical chimeric artefacts and filter false fusion candidates generated by these artefacts. We compared scFusion with bulk fusion detection methods using simulation data created based on real scRNA-seq data and found that scFusion had superior performance. Applying scFusion to a T cell data, scFusion successfully detected the invariant TCR gene recombinations in Mucosal-associated invariant T cells that many bulk methods failed to detect. In a multiple myeloma data, scFusion detected the known recurrent fusion IgH-WHSC1, which was associated with overexpression of the WHSC1 oncogene.SignificanceA critical challenge for fusion detection based on the full-length single cell RNA sequencing (scRNA-seq) is to identify the needles, or the true fusions, from a large haystack of false positives. We developed a fusion detection tool scFusion for scRNA-seq. scFusion is computationally more efficient, has far less false discoveries while achieves similar detection power compared to fusion detection tools developed for bulk data. Application of scFusion to a multiple myeloma dataset identied subclones with the fusion IgH-WHSC1 and revealed that over-expression of the oncogene WHSC1 was strongly associated with the fusion. The models developed in this work may also be generalized for other single cell analyses such as structural variation detection and the alternative splicing analysis.

Download Full-text

Differentially conserved amino acid positions may reflect differences in SARS-CoV-2 and SARS-CoV behaviour

Bioinformatics ◽

10.1093/bioinformatics/btab094 ◽

2021 ◽

Author(s):

Denisa Bojkova ◽

Jake E McGreig ◽

Katie-May McLaughlin ◽

Stuart G Masterson ◽

Magdalena Antczak ◽

...

Keyword(s):

Cell Culture ◽

Amino Acid ◽

In Silico ◽

Virus Entry ◽

Genomic Variation ◽

Supplementary Information ◽

Cell Tropism ◽

Supplementary Data ◽

Biological Behaviour ◽

Novel Coronavirus

Abstract Motivation SARS-CoV-2 is a novel coronavirus currently causing a pandemic. Here, we performed a combined in-silico and cell culture comparison of SARS-CoV-2 and the closely related SARS-CoV. Results Many amino acid positions are differentially conserved between SARS-CoV-2 and SARS-CoV, which reflects the discrepancies in virus behaviour, i.e. more effective human-to-human transmission of SARS-CoV-2 and higher mortality associated with SARS-CoV. Variations in the S protein (mediates virus entry) were associated with differences in its interaction with ACE2 (cellular S receptor) and sensitivity to TMPRSS2 (enables virus entry via S cleavage) inhibition. Anti-ACE2 antibodies more strongly inhibited SARS-CoV than SARS-CoV-2 infection, probably due to a stronger SARS-CoV-2 S-ACE2 affinity relative to SARS-CoV S. Moreover, SARS-CoV-2 and SARS-CoV displayed differences in cell tropism. Cellular ACE2 and TMPRSS2 levels did not indicate susceptibility to SARS-CoV-2. In conclusion, we identified genomic variation between SARS-CoV-2 and SARS-CoV that may reflect the differences in their clinical and biological behaviour. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ccNetViz: a WebGL-based JavaScript library for visualization of large networks

Bioinformatics ◽

10.1093/bioinformatics/btaa559 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4527-4529

Author(s):

Ales Saska ◽

David Tichy ◽

Robert Moore ◽

Achilles Rasquinha ◽

Caner Akdas ◽

...

Keyword(s):

Systems Biology ◽

Complex Networks ◽

Open Source ◽

High Speed ◽

A Priori ◽

Supplementary Information ◽

Network Visualization ◽

Supplementary Data ◽

Web Based ◽

Flow Of Information

Abstract Summary Visualizing a network provides a concise and practical understanding of the information it represents. Open-source web-based libraries help accelerate the creation of biologically based networks and their use. ccNetViz is an open-source, high speed and lightweight JavaScript library for visualization of large and complex networks. It implements customization and analytical features for easy network interpretation. These features include edge and node animations, which illustrate the flow of information through a network as well as node statistics. Properties can be defined a priori or dynamically imported from models and simulations. ccNetViz is thus a network visualization library particularly suited for systems biology. Availability and implementation The ccNetViz library, demos and documentation are freely available at http://helikarlab.github.io/ccNetViz/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text