scholarly journals Mapping the glycosyltransferase fold landscape using deep learning

2021 ◽  
Author(s):  
Rahil Taujale ◽  
Zhongliang Zhou ◽  
Wayland Yeung ◽  
Kelley W Moremen ◽  
Sheng Li ◽  
...  

Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through 10 the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small 11 molecule substrates. The extensive structural and functional diversification of GTs presents a 12 major challenge in mapping the relationships connecting sequence, structure, fold and function 13 using traditional bioinformatics approaches. Here, we present a convolutional neural network 14 with attention (CNN-attention) based deep learning model that leverages simple secondary 15 structure representations generated from primary sequences to provide GT fold prediction with 16 high accuracy. The model learned distinguishing features free of primary sequence alignment 17 constraints and, unlike other models, is highly interpretable and helped identify common 18 secondary structural features shared by divergent families. The model delineated sequence and 19 structural features characteristic of individual fold types, while classifying them into distinct 20 clusters that group evolutionarily divergent families based on shared secondary structural 21 features. We further extend our model to classify GT families of unknown folds and variants of 22 known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and 23 GT97, our studies identify targets for future structural studies and expand the GT fold landscape.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Rahil Taujale ◽  
Zhongliang Zhou ◽  
Wayland Yeung ◽  
Kelley W. Moremen ◽  
Sheng Li ◽  
...  

AbstractGlycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Melvyn Yap ◽  
Rebecca L. Johnston ◽  
Helena Foley ◽  
Samual MacDonald ◽  
Olga Kondrashova ◽  
...  

AbstractFor complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.


2021 ◽  
Author(s):  
Nadav Brandes ◽  
Dan Ofer ◽  
Yam Peleg ◽  
Nadav Rappoport ◽  
Michal Linial

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.


2021 ◽  
Vol 8 ◽  
Author(s):  
Haining Wang ◽  
Xiaoxue Fu ◽  
Chengqian Zhao ◽  
Zhendong Luan ◽  
Chaolun Li

Characterizing habitats and species distribution is important to understand the structure and function of cold seep ecosystems. This paper develops a deep learning model for the fast and accurate recognition and classification of substrates and the dominant associated species in cold seeps. Considering the dense distribution of the dominant associated species and small objects caused by overlap in cold seeps, the feature pyramid network (FPN) embed into the faster region-convolutional neural network (R-CNN) was used to detect large-scale changes and small missing objects without increasing the number of calculations. We applied three classifiers (Faster R-CNN + FPN for mussel beds, lobster clusters and biological mixing, CNN for shell debris and exposed authigenic carbonates, and VGG16 for reduced sediments and muddy bottom) to improve the recognition accuracy of substrates. The model’s results were manually verified using images obtained in the Formosa cold seep during a 2016 cruise. The recognition accuracy of the two dominant species, e.g., Gigantidas platifrons and Munidopsidae could be 70.85 and 56.16%, respectively. Seven subcategories of substrates were also classified with a mean accuracy of 74.87%. The developed model is a promising tool for the fast and accurate characterization of substrates and epifauna in cold seeps, which is crucial for large-scale quantitative analyses.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Angran Li ◽  
Amir Barati Farimani ◽  
Yongjie Jessica Zhang

AbstractNeurons exhibit complex geometry in their branched networks of neurites which is essential to the function of individual neuron but also brings challenges to transport a wide variety of essential materials throughout their neurite networks for their survival and function. While numerical methods like isogeometric analysis (IGA) have been used for modeling the material transport process via solving partial differential equations (PDEs), they require long computation time and huge computation resources to ensure accurate geometry representation and solution, thus limit their biomedical application. Here we present a graph neural network (GNN)-based deep learning model to learn the IGA-based material transport simulation and provide fast material concentration prediction within neurite networks of any topology. Given input boundary conditions and geometry configurations, the well-trained model can predict the dynamical concentration change during the transport process with an average error less than 10% and $$120 \sim 330$$ 120 ∼ 330 times faster compared to IGA simulations. The effectiveness of the proposed model is demonstrated within several complex neurite networks.


2020 ◽  
Vol 13 (4) ◽  
pp. 627-640 ◽  
Author(s):  
Avinash Chandra Pandey ◽  
Dharmveer Singh Rajpoot

Background: Sentiment analysis is a contextual mining of text which determines viewpoint of users with respect to some sentimental topics commonly present at social networking websites. Twitter is one of the social sites where people express their opinion about any topic in the form of tweets. These tweets can be examined using various sentiment classification methods to find the opinion of users. Traditional sentiment analysis methods use manually extracted features for opinion classification. The manual feature extraction process is a complicated task since it requires predefined sentiment lexicons. On the other hand, deep learning methods automatically extract relevant features from data hence; they provide better performance and richer representation competency than the traditional methods. Objective: The main aim of this paper is to enhance the sentiment classification accuracy and to reduce the computational cost. Method: To achieve the objective, a hybrid deep learning model, based on convolution neural network and bi-directional long-short term memory neural network has been introduced. Results: The proposed sentiment classification method achieves the highest accuracy for the most of the datasets. Further, from the statistical analysis efficacy of the proposed method has been validated. Conclusion: Sentiment classification accuracy can be improved by creating veracious hybrid models. Moreover, performance can also be enhanced by tuning the hyper parameters of deep leaning models.


2019 ◽  
Vol 9 (22) ◽  
pp. 4871 ◽  
Author(s):  
Quan Liu ◽  
Chen Feng ◽  
Zida Song ◽  
Joseph Louis ◽  
Jian Zhou

Earthmoving is an integral civil engineering operation of significance, and tracking its productivity requires the statistics of loads moved by dump trucks. Since current truck loads’ statistics methods are laborious, costly, and limited in application, this paper presents the framework of a novel, automated, non-contact field earthmoving quantity statistics (FEQS) for projects with large earthmoving demands that use uniform and uncovered trucks. The proposed FEQS framework utilizes field surveillance systems and adopts vision-based deep learning for full/empty-load truck classification as the core work. Since convolutional neural network (CNN) and its transfer learning (TL) forms are popular vision-based deep learning models and numerous in type, a comparison study is conducted to test the framework’s core work feasibility and evaluate the performance of different deep learning models in implementation. The comparison study involved 12 CNN or CNN-TL models in full/empty-load truck classification, and the results revealed that while several provided satisfactory performance, the VGG16-FineTune provided the optimal performance. This proved the core work feasibility of the proposed FEQS framework. Further discussion provides model choice suggestions that CNN-TL models are more feasible than CNN prototypes, and models that adopt different TL methods have advantages in either working accuracy or speed for different tasks.


Sign in / Sign up

Export Citation Format

Share Document