Mapping the glycosyltransferase fold landscape using interpretable deep learning

AbstractGlycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

Download Full-text

Mapping the glycosyltransferase fold landscape using deep learning

10.1101/2021.07.05.451183 ◽

2021 ◽

Author(s):

Rahil Taujale ◽

Zhongliang Zhou ◽

Wayland Yeung ◽

Kelley W Moremen ◽

Sheng Li ◽

...

Keyword(s):

Deep Learning ◽

Structural Features ◽

Functional Diversification ◽

Sequence Structure ◽

Cellular Processes ◽

And Function ◽

Deep Learning Model ◽

Distinguishing Features ◽

Fold Prediction ◽

Primary Sequence Alignment

Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through 10 the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small 11 molecule substrates. The extensive structural and functional diversification of GTs presents a 12 major challenge in mapping the relationships connecting sequence, structure, fold and function 13 using traditional bioinformatics approaches. Here, we present a convolutional neural network 14 with attention (CNN-attention) based deep learning model that leverages simple secondary 15 structure representations generated from primary sequences to provide GT fold prediction with 16 high accuracy. The model learned distinguishing features free of primary sequence alignment 17 constraints and, unlike other models, is highly interpretable and helped identify common 18 secondary structural features shared by divergent families. The model delineated sequence and 19 structural features characteristic of individual fold types, while classifying them into distinct 20 clusters that group evolutionarily divergent families based on shared secondary structural 21 features. We further extend our model to classify GT families of unknown folds and variants of 22 known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and 23 GT97, our studies identify targets for future structural studies and expand the GT fold landscape.

Download Full-text

Deep learning model for unstructured knowledge classification using structural features

Personal and Ubiquitous Computing ◽

10.1007/s00779-019-01244-x ◽

2019 ◽

Author(s):

Wonkyun Joo ◽

KiSeok Choi ◽

Young-Kuk Kim

Keyword(s):

Deep Learning ◽

Learning Model ◽

Structural Features ◽

Deep Learning Model ◽

Knowledge Classification

Download Full-text

Deep learning model with ensemble techniques to compute the secondary structure of proteins

The Journal of Supercomputing ◽

10.1007/s11227-020-03467-9 ◽

2020 ◽

Author(s):

Rayed AlGhamdi ◽

Azra Aziz ◽

Mohammed Alshehri ◽

Kamal Raj Pardasani ◽

Tarique Aziz

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Learning Model ◽

Ensemble Techniques ◽

Secondary Structure Of Proteins ◽

Deep Learning Model ◽

Structure Of Proteins

Download Full-text

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data

Scientific Reports ◽

10.1038/s41598-021-81773-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Melvyn Yap ◽

Rebecca L. Johnston ◽

Helena Foley ◽

Samual MacDonald ◽

Olga Kondrashova ◽

...

Keyword(s):

Deep Learning ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Tissue Expression ◽

Superior Performance ◽

Rna Seq ◽

Widespread Acceptance ◽

And Function ◽

Deep Learning Model

AbstractFor complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.

Download Full-text

LPI-DL: A recurrent deep learning model for plant lncRNA-protein interaction and function prediction with feature optimization

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm49941.2020.9313431 ◽

2020 ◽

Author(s):

Jael Sanyanda Wekesa ◽

Yushi Luan ◽

Jun Meng

Keyword(s):

Deep Learning ◽

Protein Interaction ◽

Learning Model ◽

Function Prediction ◽

Feature Optimization ◽

And Function ◽

Deep Learning Model

Download Full-text

ProteinBERT: A universal deep-learning model of protein sequence and function

10.1101/2021.05.24.445464 ◽

2021 ◽

Author(s):

Nadav Brandes ◽

Dan Ofer ◽

Yam Peleg ◽

Nadav Rappoport ◽

Michal Linial

Keyword(s):

Deep Learning ◽

Language Model ◽

Language Modeling ◽

Post Translational Modifications ◽

Go Annotation ◽

Architectural Elements ◽

Protein Properties ◽

And Function ◽

Deep Learning Model ◽

Biophysical Attributes

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

Download Full-text

rawMSA: End-to-end Deep Learning Makes Protein Sequence Profiles and Feature Extraction obsolete

10.1101/394437 ◽

2018 ◽

Cited By ~ 3

Author(s):

Claudio Mirabello ◽

Björn Wallner

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Solvent Accessibility ◽

Structural Features ◽

Amino Acid Sequences ◽

Relative Solvent Accessibility ◽

Large Set ◽

Sequence Alignments ◽

Residue Contact ◽

Sequence Profiles

AbstractIn the last few decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and about their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed in the hope of taking advantage of these new architectures. On the other hand, most methods are still based on heavy pre-processing of the input data, as well as the extraction and integration of multiple hand-picked, manually designed features. Since Multiple Sequence Alignments (MSA) are almost always the main source of information in de novo prediction methods, it should be possible to develop Deep Networks to automatically refine the data and extract useful features from it. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering sequence profiles and other pre-calculated features obsolete. We developed rawMSA in three different flavors to predict secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on a par with the top ranked CASP12 methods in the inter-residue contact map prediction category. We believe that rawMSA represents a promising, more powerful approach to protein structure prediction that could replace older methods based on protein profiles in the coming years.Availabilitydatasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa

Download Full-text

A Deep Learning Model to Recognize and Quantitatively Analyze Cold Seep Substrates and the Dominant Associated Species

Frontiers in Marine Science ◽

10.3389/fmars.2021.775433 ◽

2021 ◽

Vol 8 ◽

Author(s):

Haining Wang ◽

Xiaoxue Fu ◽

Chengqian Zhao ◽

Zhendong Luan ◽

Chaolun Li

Keyword(s):

Deep Learning ◽

Large Scale ◽

Recognition Accuracy ◽

Learning Model ◽

Cold Seep ◽

Cold Seeps ◽

Promising Tool ◽

And Function ◽

Associated Species ◽

Deep Learning Model

Characterizing habitats and species distribution is important to understand the structure and function of cold seep ecosystems. This paper develops a deep learning model for the fast and accurate recognition and classification of substrates and the dominant associated species in cold seeps. Considering the dense distribution of the dominant associated species and small objects caused by overlap in cold seeps, the feature pyramid network (FPN) embed into the faster region-convolutional neural network (R-CNN) was used to detect large-scale changes and small missing objects without increasing the number of calculations. We applied three classifiers (Faster R-CNN + FPN for mussel beds, lobster clusters and biological mixing, CNN for shell debris and exposed authigenic carbonates, and VGG16 for reduced sediments and muddy bottom) to improve the recognition accuracy of substrates. The model’s results were manually verified using images obtained in the Formosa cold seep during a 2016 cruise. The recognition accuracy of the two dominant species, e.g., Gigantidas platifrons and Munidopsidae could be 70.85 and 56.16%, respectively. Seven subcategories of substrates were also classified with a mean accuracy of 74.87%. The developed model is a promising tool for the fast and accurate characterization of substrates and epifauna in cold seeps, which is crucial for large-scale quantitative analyses.

Download Full-text

Deep learning of material transport in complex neurite networks

Scientific Reports ◽

10.1038/s41598-021-90724-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Angran Li ◽

Amir Barati Farimani ◽

Yongjie Jessica Zhang

Keyword(s):

Deep Learning ◽

Transport Process ◽

Complex Geometry ◽

Biomedical Application ◽

Computation Time ◽

Average Error ◽

Material Transport ◽

Proposed Model ◽

And Function ◽

Deep Learning Model

AbstractNeurons exhibit complex geometry in their branched networks of neurites which is essential to the function of individual neuron but also brings challenges to transport a wide variety of essential materials throughout their neurite networks for their survival and function. While numerical methods like isogeometric analysis (IGA) have been used for modeling the material transport process via solving partial differential equations (PDEs), they require long computation time and huge computation resources to ensure accurate geometry representation and solution, thus limit their biomedical application. Here we present a graph neural network (GNN)-based deep learning model to learn the IGA-based material transport simulation and provide fast material concentration prediction within neurite networks of any topology. Given input boundary conditions and geometry configurations, the well-trained model can predict the dynamical concentration change during the transport process with an average error less than 10% and $$120 \sim 330$$ 120 ∼ 330 times faster compared to IGA simulations. The effectiveness of the proposed model is demonstrated within several complex neurite networks.

Download Full-text

Deep Learning Model Selection of Suboptimal Complexity

Автоматика и телемеханика ◽

10.31857/s000523100001252-1 ◽

2018 ◽

pp. 129-147

Author(s):

Oleg Bakhteev ◽

◽

Vadim Strijov ◽

Keyword(s):

Deep Learning ◽

Model Selection ◽

Learning Model ◽

Deep Learning Model ◽

Selection Of

Download Full-text