Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep): Protein Engineering as A Case Study

AbstractIn recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far deep learning of proteins has mostly utilized protein primary sequence information, while the vast amount of protein tertiary structural information remains unused. In this study, we devised a self-supervised representation learning framework to extract the fundamental features of unlabeled protein tertiary structures (PtsRep), and the embedded representations were transferred to two commonly recognized protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks, PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT), which are based on protein primary sequences. Protein clustering analyses demonstrated that PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general protein structural representation learning, and for exploring protein structural space for protein engineering and drug design.

Download Full-text

FLIP: Benchmark tasks in fitness landscape inference for proteins

10.1101/2021.11.09.467890 ◽

2021 ◽

Author(s):

Christian Dallago ◽

Jody Mou ◽

Kadina E Johnston ◽

Bruce Wittmann ◽

Nicholas Bhattacharya ◽

...

Keyword(s):

Machine Learning ◽

Protein Engineering ◽

Fitness Landscape ◽

Representation Learning ◽

Industrial Applications ◽

Ease Of Use ◽

Protein Domain ◽

Standard Format ◽

Model Generalization ◽

And Function

Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing benchmarks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties/home.

Download Full-text

Automatic Detection and Neurotransmitter Prediction of Synapses in Electron Microscopy

10.1101/2021.11.02.467022 ◽

2021 ◽

Author(s):

Angela Zhang ◽

S Shailja ◽

Cezar Borba ◽

Yishen Miao ◽

Michael Goebel ◽

...

Keyword(s):

Neural Network ◽

Electron Microscopy ◽

Deep Learning ◽

Structural Information ◽

Structure And Function ◽

Neuron Type ◽

Cell Type ◽

Domain Expertise ◽

And Function ◽

Synapse Detection

This paper presents a deep-learning based workflow to detect synapses and predict their neurotransmitter type in the primitive chordate Ciona intestinalis (Ciona) EM images. Identifying synapses from electron microscopy (EM) images to build a full map of connections between neurons is a labor-intensive process and requires significant domain expertise. Automation of synapse detection and classification would hasten the generation and analysis of connectomes. Furthermore, inferences concerning neuron type and function from synapse features are in many cases difficult to make. Finding the connection between synapse structure and function is an important step in fully understanding a connectome. Activation maps derived from the convolutional neural network provide insights on important features of synapses based on cell type and function. The main contribution of this work is in the differentiation of synapses by neurotransmitter type through the structural information in their EM images. This enables prediction of neurotransmitter types for neurons in Ciona which were previously unknown. The prediction model with code is available on Github.

Download Full-text

Accurate classification of membrane protein types based on sequence and evolutionary information using deep learning

BMC Bioinformatics ◽

10.1186/s12859-019-3275-6 ◽

2019 ◽

Vol 20 (S25) ◽

Cited By ~ 1

Author(s):

Lei Guo ◽

Shunfang Wang ◽

Mingyuan Li ◽

Zicheng Cao

Keyword(s):

Deep Learning ◽

Membrane Proteins ◽

Membrane Protein ◽

Success Rate ◽

Evolutionary Information ◽

Sequence Information ◽

Learning Models ◽

Representation Method ◽

The One ◽

And Function

Abstract Background Membrane proteins play an important role in the life activities of organisms. Knowing membrane protein types provides clues for understanding the structure and function of proteins. Though various computational methods for predicting membrane protein types have been developed, the results still do not meet the expectations of researchers. Results We propose two deep learning models to process sequence information and evolutionary information, respectively. Both models obtained better results than traditional machine learning models. Furthermore, to improve the performance of the sequence information model, we also provide a new vector representation method to replace the one-hot encoding, whose overall success rate improved by 3.81% and 6.55% on two datasets. Finally, a more effective model is obtained by fusing the above two models, whose overall success rate reached 95.68% and 92.98% on two datasets. Conclusion The final experimental results show that our method is more effective than existing methods for predicting membrane protein types, which can help laboratory researchers to identify the type of novel membrane proteins.

Download Full-text

Fully Automated Colorimetric Analysis of the Optic Nerve Aided by Deep Learning and Its Association with Perimetry and OCT for the Study of Glaucoma

Journal of Clinical Medicine ◽

10.3390/jcm10153231 ◽

2021 ◽

Vol 10 (15) ◽

pp. 3231

Author(s):

Marta Gonzalez-Hernandez ◽

Daniel Gonzalez-Hernandez ◽

Daniel Perez-Barbudo ◽

Paloma Rodriguez-Esteve ◽

Nisamar Betancor-Caro ◽

...

Keyword(s):

Distribution Function ◽

Deep Learning ◽

Optic Nerve ◽

Roc Analysis ◽

Operating Characteristic ◽

Learning Models ◽

Colorimetric Analysis ◽

Cirrus Oct ◽

And Function ◽

Usual Pattern

Background: Laguna-ONhE is an application for the colorimetric analysis of optic nerve images, which topographically assesses the cup and the presence of haemoglobin. Its latest version has been fully automated with five deep learning models. In this paper, perimetry in combination with Laguna-ONhE or Cirrus-OCT was evaluated. Methods: The morphology and perfusion estimated by Laguna ONhE were compiled into a “Globin Distribution Function” (GDF). Visual field irregularity was measured with the usual pattern standard deviation (PSD) and the threshold coefficient of variation (TCV), which analyses its harmony without taking into account age-corrected values. In total, 477 normal eyes, 235 confirmed, and 98 suspected glaucoma cases were examined with Cirrus-OCT and different fundus cameras and perimeters. Results: The best Receiver Operating Characteristic (ROC) analysis results for confirmed and suspected glaucoma were obtained with the combination of GDF and TCV (AUC: 0.995 and 0.935, respectively. Sensitivities: 94.5% and 45.9%, respectively, for 99% specificity). The best combination of OCT and perimetry was obtained with the vertical cup/disc ratio and PSD (AUC: 0.988 and 0.847, respectively. Sensitivities: 84.7% and 18.4%, respectively, for 99% specificity). Conclusion: Using Laguna ONhE, morphology, perfusion, and function can be mutually enhanced with the methods described for the purpose of glaucoma assessment, providing early sensitivity.

Download Full-text

A Novel Method to Predict Drug-Target Interactions Based on Large-Scale Graph Representation Learning

Cancers ◽

10.3390/cancers13092111 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2111

Author(s):

Bo-Wei Zhao ◽

Zhu-Hong You ◽

Lun Hu ◽

Zhen-Hao Guo ◽

Lei Wang ◽

...

Keyword(s):

Drug Target ◽

Large Scale ◽

Computational Models ◽

Structural Information ◽

Characteristic Curve ◽

Representation Learning ◽

Graph Representation ◽

Convolutional Network ◽

Novel Method

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.

Download Full-text

Representation Learning for Fine-Grained Change Detection

Sensors ◽

10.3390/s21134486 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4486

Author(s):

Niall O’Mahony ◽

Sean Campbell ◽

Lenka Krpalkova ◽

Anderson Carvalho ◽

Joseph Walsh ◽

...

Keyword(s):

Deep Learning ◽

Change Detection ◽

Model Calibration ◽

State Of The Art ◽

Representation Learning ◽

Machine Intelligence ◽

The State ◽

Sensor Data ◽

Fine Grained ◽

Learning Techniques

Fine-grained change detection in sensor data is very challenging for artificial intelligence though it is critically important in practice. It is the process of identifying differences in the state of an object or phenomenon where the differences are class-specific and are difficult to generalise. As a result, many recent technologies that leverage big data and deep learning struggle with this task. This review focuses on the state-of-the-art methods, applications, and challenges of representation learning for fine-grained change detection. Our research focuses on methods of harnessing the latent metric space of representation learning techniques as an interim output for hybrid human-machine intelligence. We review methods for transforming and projecting embedding space such that significant changes can be communicated more effectively and a more comprehensive interpretation of underlying relationships in sensor data is facilitated. We conduct this research in our work towards developing a method for aligning the axes of latent embedding space with meaningful real-world metrics so that the reasoning behind the detection of change in relation to past observations may be revealed and adjusted. This is an important topic in many fields concerned with producing more meaningful and explainable outputs from deep learning and also for providing means for knowledge injection and model calibration in order to maintain user confidence.

Download Full-text

Structural Characterization of Complex Bacterial Glycolipids by Fourier Transform Mass Spectrometry

European Journal of Mass Spectrometry ◽

10.1255/ejms.721 ◽

2005 ◽

Vol 11 (5) ◽

pp. 535-546 ◽

Cited By ~ 39

Author(s):

Anna Kondakov ◽

Buko Lindner

Keyword(s):

Mass Spectrometry ◽

Fourier Transform ◽

Structural Information ◽

Adaptive Immune System ◽

Ion Cyclotron Resonance ◽

Bacterial Membranes ◽

Multiphoton Dissociation ◽

The One ◽

And Function

Bacterial glycolipids are complex amphiphilic molecules which are, on the one hand, of utmost importance for the organization and function of bacterial membranes and which, on the other hand, play a major role in the activation of cells of the innate and adaptive immune system of the host. Already small alterations to their chemical structure may influence the biological activity tremendously. Due to their intrinsic biological heterogeneity [number and type of fatty acids, saccharide structures and substitution with for example, phosphate ( P), 2-aminoethyl-(pyro)phosphate groups ( P-Etn) or 4-amino-4-deoxyarabinose (Ara4N)], separation of the different components are a prerequisite for unequivocal chemical and nuclear magnetic resonance structural analyses. In this contribution, the structural information which can be obtained from heterogenous samples of glycolipids by Fourier transform (FT) ion cyclotron resonance mass spectrometric methods is described. By means of recently analysed complex biological samples, the possibilities of high-resolution electrospray ionization FT-MS are demonstrated. Capillary skimmer dissociation, as well as tandem mass spectrometry (MS/MS) analysis utilizing collision-induced dissociation and infrared multiphoton dissociation, are compared and their advantages in providing structural information of diagnostic importance are discussed.

Download Full-text

Predicting Chromosome Flexibility from the Genomic Sequence Based on Deep Learning Neural Networks

Current Bioinformatics ◽

10.2174/1574893616666210827095829 ◽

2021 ◽

Vol 16 ◽

Author(s):

Jinghao Peng ◽

Jiajie Peng ◽

Haiyin Piao ◽

Zhang Luo ◽

Kelin Xia ◽

...

Keyword(s):

Deep Learning ◽

High Performance ◽

Genomic Sequence ◽

Sequence Data ◽

Function Analysis ◽

Double Helix ◽

Gm12878 Cell ◽

Genomic Sequence Analysis ◽

And Function ◽

Nuclear Processes

Background: The open and accessible regions of the chromosome are more likely to be bound by transcription factors which are important for nuclear processes and biological functions. Studying the change of chromosome flexibility can help to discover and analyze disease markers and improve the efficiency of clinical diagnosis. Current methods for predicting chromosome flexibility based on Hi-C data include the flexibility-rigidity index (FRI) and the Gaussian network model (GNM), which have been proposed to characterize chromosome flexibility. However, these methods require the chromosome structure data based on 3D biological experiments, which is time-consuming and expensive. Objective: Generally, the folding and curling of the double helix sequence of DNA have a great impact on chromosome flexibility and function. Motivated by the success of genomic sequence analysis in biomolecular function analysis, we hope to propose a method to predict chromosome flexibility only based on genomic sequence data. Method: We propose a new method (named "DeepCFP") using deep learning models to predict chromosome flexibility based on only genomic sequence features. The model has been tested in the GM12878 cell line. Results: The maximum accuracy of our model has reached 91%. The performance of DeepCFP is close to FRI and GNM. Conclusion: The DeepCFP can achieve high performance only based on genomic sequence.

Download Full-text

MUFFIN: multi-scale feature fusion for drug–drug interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btab169 ◽

2021 ◽

Author(s):

Yujie Chen ◽

Tengfei Ma ◽

Xixi Yang ◽

Jianmin Wang ◽

Bosheng Song ◽

...

Keyword(s):

Molecular Structure ◽

Deep Learning ◽

Medical Information ◽

Feature Fusion ◽

Molecular Graph ◽

Knowledge Graph ◽

Sequence Information ◽

Learning Models ◽

Scale Feature ◽

Multi Scale

Abstract Motivation Adverse drug–drug interactions (DDIs) are crucial for drug research and mainly cause morbidity and mortality. Thus, the identification of potential DDIs is essential for doctors, patients and the society. Existing traditional machine learning models rely heavily on handcraft features and lack generalization. Recently, the deep learning approaches that can automatically learn drug features from the molecular graph or drug-related network have improved the ability of computational models to predict unknown DDIs. However, previous works utilized large labeled data and merely considered the structure or sequence information of drugs without considering the relations or topological information between drug and other biomedical objects (e.g. gene, disease and pathway), or considered knowledge graph (KG) without considering the information from the drug molecular structure. Results Accordingly, to effectively explore the joint effect of drug molecular structure and semantic information of drugs in knowledge graph for DDI prediction, we propose a multi-scale feature fusion deep learning model named MUFFIN. MUFFIN can jointly learn the drug representation based on both the drug-self structure information and the KG with rich bio-medical information. In MUFFIN, we designed a bi-level cross strategy that includes cross- and scalar-level components to fuse multi-modal features well. MUFFIN can alleviate the restriction of limited labeled data on deep learning models by crossing the features learned from large-scale KG and drug molecular graph. We evaluated our approach on three datasets and three different tasks including binary-class, multi-class and multi-label DDI prediction tasks. The results showed that MUFFIN outperformed other state-of-the-art baselines. Availability and implementation The source code and data are available at https://github.com/xzenglab/MUFFIN.

Download Full-text

Mapping the glycosyltransferase fold landscape using interpretable deep learning

Nature Communications ◽

10.1038/s41467-021-25975-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Rahil Taujale ◽

Zhongliang Zhou ◽

Wayland Yeung ◽

Kelley W. Moremen ◽

Sheng Li ◽

...

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Structural Features ◽

Functional Diversification ◽

Sequence Structure ◽

Cellular Processes ◽

And Function ◽

Deep Learning Model ◽

Fold Prediction ◽

Primary Sequence Alignment

AbstractGlycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

Download Full-text