Neural networks to learn protein sequence-function relationships from deep mutational scanning data

ABSTRACTThe mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Our software is available from https://github.com/gitter-lab/nn4dms.

Download Full-text

Neural networks to learn protein sequence–function relationships from deep mutational scanning data

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2104878118 ◽

2021 ◽

Vol 118 (48) ◽

pp. e2104878118

Author(s):

Sam Gelman ◽

Sarah A. Fahlberg ◽

Pete Heinzelman ◽

Philip A. Romero ◽

Anthony Gitter

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Internal Representation ◽

Superior Performance ◽

Network Architectures ◽

Convolutional Network ◽

Learning Framework ◽

And Function ◽

Multiple Neural Network ◽

Function Mapping

The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence–function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence–function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.

Download Full-text

Multimodal deep representation learning for protein interaction identification and protein family classification

BMC Bioinformatics ◽

10.1186/s12859-019-3084-y ◽

2019 ◽

Vol 20 (S16) ◽

Cited By ~ 4

Author(s):

Da Zhang ◽

Mansur Kabuka

Keyword(s):

Protein Interactions ◽

Protein Sequence ◽

Representation Learning ◽

Superior Performance ◽

Sequence Information ◽

Protein Protein Interactions ◽

Learning Framework ◽

Topological Features ◽

Ppi Networks ◽

Ppi Prediction

Abstract Background Protein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge. Results In this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods. Conclusion To the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.

Download Full-text

A New Ensemble Learning Framework for 3D Biomedical Image Segmentation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015909 ◽

2019 ◽

Vol 33 ◽

pp. 5909-5916 ◽

Cited By ~ 6

Author(s):

Hao Zheng ◽

Yizhe Zhang ◽

Lin Yang ◽

Peixian Liang ◽

Zhuo Zhao ◽

...

Keyword(s):

Image Segmentation ◽

Ensemble Learning ◽

State Of The Art ◽

3D Models ◽

Superior Performance ◽

3D Image ◽

Convolutional Network ◽

Biomedical Image ◽

Learning Framework ◽

2D And 3D

3D image segmentation plays an important role in biomedical image analysis. Many 2D and 3D deep learning models have achieved state-of-the-art segmentation performance on 3D biomedical image datasets. Yet, 2D and 3D models have their own strengths and weaknesses, and by unifying them together, one may be able to achieve more accurate results. In this paper, we propose a new ensemble learning framework for 3D biomedical image segmentation that combines the merits of 2D and 3D models. First, we develop a fully convolutional network based meta-learner to learn how to improve the results from 2D and 3D models (base-learners). Then, to minimize over-fitting for our sophisticated meta-learner, we devise a new training method that uses the results of the baselearners as multiple versions of “ground truths”. Furthermore, since our new meta-learner training scheme does not depend on manual annotation, it can utilize abundant unlabeled 3D image data to further improve the model. Extensive experiments on two public datasets (the HVSMR 2016 Challenge dataset and the mouse piriform cortex dataset) show that our approach is effective under fully-supervised, semisupervised, and transductive settings, and attains superior performance over state-of-the-art image segmentation methods.

Download Full-text

Inferring protein sequence-function relationships with large-scale positive-unlabeled learning

10.1101/2020.08.19.257642 ◽

2020 ◽

Cited By ~ 2

Author(s):

Hyebin Song ◽

Bennett J. Bremer ◽

Emily C. Hinds ◽

Garvesh Raskutti ◽

Philip A. Romero

Keyword(s):

Protein Sequence ◽

Large Scale ◽

Sampling Error ◽

Predictive Performance ◽

Data Sets ◽

Learning Framework ◽

Estimated Parameters ◽

Pu Learning ◽

And Function ◽

Key Residues

SummaryMachine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It’s challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Importantly, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function data sets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Download Full-text

Protein Sequence Coevolution, Energy Landscapes and their Connections to Protein Structure, Folding and Function

Biophysical Journal ◽

10.1016/j.bpj.2017.11.2151 ◽

2018 ◽

Vol 114 (3) ◽

pp. 389a ◽

Cited By ~ 1

Author(s):

Jose N. Onuchic ◽

Faruck Morcos

Keyword(s):

Protein Structure ◽

Protein Sequence ◽

Energy Landscapes ◽

And Function

Download Full-text

Dual Policy Distillation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/435 ◽

2020 ◽

Author(s):

Kwei-Herng Lai ◽

Daochen Zha ◽

Yuening Li ◽

Xia Hu

Keyword(s):

Reinforcement Learning ◽

Superior Performance ◽

Great Success ◽

Continuous Control ◽

Learning Framework ◽

Teacher Student ◽

Trained Teacher ◽

Challenging Tasks ◽

And Function ◽

Teacher Model

Policy distillation, which transfers a teacher policy to a student policy has achieved great success in challenging tasks of deep reinforcement learning. This teacher-student framework requires a well-trained teacher model which is computationally expensive. Moreover, the performance of the student model could be limited by the teacher model if the teacher model is not optimal. In the light of collaborative learning, we study the feasibility of involving joint intellectual efforts from diverse perspectives of student models. In this work, we introduce dual policy distillation (DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms, since it is unclear whether the knowledge distilled from an imperfect and noisy peer learner would be helpful. To address the challenge, we theoretically justify that distilling knowledge from a peer learner will lead to policy improvement and propose a disadvantageous distillation strategy based on the theoretical results. The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.

Download Full-text

Faculty Opinions recommendation of A minimal sequence code for switching protein structure and function.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1367956.838071 ◽

2010 ◽

Author(s):

H Jane Dyson ◽

Gira Bhabha

Keyword(s):

Protein Structure ◽

Structure And Function ◽

Protein Structure And Function ◽

Sequence Code ◽

Minimal Sequence ◽

And Function

Download Full-text

Faculty Opinions recommendation of A minimal sequence code for switching protein structure and function.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1367956.1868054 ◽

2010 ◽

Author(s):

Nick Grishin

Keyword(s):

Protein Structure ◽

Structure And Function ◽

Protein Structure And Function ◽

Sequence Code ◽

Minimal Sequence ◽

And Function

Download Full-text

Faculty Opinions recommendation of Small-molecule structure correctors target abnormal protein structure and function: structure corrector rescue of apolipoprotein E4-associated neuropathology.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.717959324.793462720 ◽

2012 ◽

Author(s):

John Lowe

Keyword(s):

Protein Structure ◽

Small Molecule ◽

Structure And Function ◽

Function Structure ◽

Protein Structure And Function ◽

Abnormal Protein ◽

Apolipoprotein E4 ◽

Molecule Structure ◽

And Function

Download Full-text

Protein Structural Class Prediction Based on Distance-related Statistical Features from Graphical Representation of Predicted Secondary Structure

Letters in Organic Chemistry ◽

10.2174/1570178615666180914110451 ◽

2019 ◽

Vol 16 (4) ◽

pp. 317-324

Author(s):

Liang Kong ◽

Lichao Zhang ◽

Xiaodong Han ◽

Jinfeng Lv

Keyword(s):

Feature Extraction ◽

Secondary Structure ◽

Protein Sequence ◽

Function Analysis ◽

Superior Performance ◽

Support Vector ◽

Chaos Game Representation ◽

Class Prediction ◽

Structural Class ◽

Protein Structural Class

Protein structural class prediction is beneficial to protein structure and function analysis. Exploring good feature representation is a key step for this prediction task. Prior works have demonstrated the effectiveness of the secondary structure based feature extraction methods especially for lowsimilarity protein sequences. However, the prediction accuracies still remain limited. To explore the potential of secondary structure information, a novel feature extraction method based on a generalized chaos game representation of predicted secondary structure is proposed. Each protein sequence is converted into a 20-dimensional distance-related statistical feature vector to characterize the distribution of secondary structure elements and segments. The feature vectors are then fed into a support vector machine classifier to predict the protein structural class. Our experiments on three widely used lowsimilarity benchmark datasets (25PDB, 1189 and 640) show that the proposed method achieves superior performance to the state-of-the-art methods. It is anticipated that our method could be extended to other graphical representations of protein sequence and be helpful in future protein research.

Download Full-text