A structural homology approach for computational protein design with flexible backbone

David Simoncini; Kam Y J Zhang; Thomas Schiex; Sophie Barbe

doi:10.1093/bioinformatics/bty975

Positive multistate protein design

Bioinformatics ◽

10.1093/bioinformatics/btz497 ◽

2019 ◽

Vol 36 (1) ◽

pp. 122-130

Author(s):

Jelena Vucinic ◽

David Simoncini ◽

Manon Ruffini ◽

Sophie Barbe ◽

Thomas Schiex

Keyword(s):

Protein Design ◽

Critical Role ◽

Average Energy ◽

Computational Design ◽

Amino Acid Sequences ◽

Supplementary Information ◽

Backbone Flexibility ◽

Identify Amino Acid ◽

Design Software ◽

Protein Redesign

Abstract Motivation Structure-based computational protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. The usual approach considers a single rigid backbone as a target, which ignores backbone flexibility. Multistate design (MSD) allows instead to consider several backbone states simultaneously, defining challenging computational problems. Results We introduce efficient reductions of positive MSD problems to Cost Function Networks with two different fitness definitions and implement them in the Pompd (Positive Multistate Protein design) software. Pompd is able to identify guaranteed optimal sequences of positive multistate full protein redesign problems and exhaustively enumerate suboptimal sequences close to the MSD optimum. Applied to nuclear magnetic resonance and back-rubbed X-ray structures, we observe that the average energy fitness provides the best sequence recovery. Our method outperforms state-of-the-art guaranteed computational design approaches by orders of magnitudes and can solve MSD problems with sizes previously unreachable with guaranteed algorithms. Availability and implementation https://forgemia.inra.fr/thomas.schiex/pompd as documented Open Source. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Challenges in the computational design of proteins

Journal of The Royal Society Interface ◽

10.1098/rsif.2008.0508.focus ◽

2009 ◽

Vol 6 (suppl_4) ◽

Cited By ~ 38

Author(s):

María Suárez ◽

Alfonso Jaramillo

Keyword(s):

Amino Acid ◽

Structural Biology ◽

Protein Design ◽

Current Knowledge ◽

Computational Design ◽

Amino Acid Sequences ◽

Computational Protein Design ◽

Energy Functions ◽

Physical Description ◽

Atomic Interactions

Protein design has many applications not only in biotechnology but also in basic science. It uses our current knowledge in structural biology to predict, by computer simulations, an amino acid sequence that would produce a protein with targeted properties. As in other examples of synthetic biology, this approach allows the testing of many hypotheses in biology. The recent development of automated computational methods to design proteins has enabled proteins to be designed that are very different from any known ones. Moreover, some of those methods mostly rely on a physical description of atomic interactions, which allows the designed sequences not to be biased towards known proteins. In this paper, we will describe the use of energy functions in computational protein design, the use of atomic models to evaluate the free energy in the unfolded and folded states, the exploration and optimization of amino acid sequences, the problem of negative design and the design of biomolecular function. We will also consider its use together with the experimental techniques such as directed evolution. We will end by discussing the challenges ahead in computational protein design and some of their future applications.

Download Full-text

DenseCPD: Improving the Accuracy of Neural-Network-Based Computational Protein Sequence Design with DenseNet

10.26434/chemrxiv.11626098 ◽

2020 ◽

Author(s):

Yifei Qi ◽

John Z.H. Zhang

Keyword(s):

Neural Network ◽

Protein Design ◽

Protein Sequence ◽

Protein Structures ◽

Three Dimensional ◽

Search Space ◽

Computational Protein Design ◽

Data Sets ◽

Protein Backbone ◽

Natural Amino Acids

<p>Computational protein design remains a challenging task despite its remarkable success in the past few decades. With the rapid progress of deep-learning techniques and the accumulation of three-dimensional protein structures, using deep neural networks to learn the relationship between protein sequences and structures and then automatically design a protein sequence for a given protein backbone structure is becoming increasingly feasible. In this study, we developed a deep neural network named DenseCPD that considers the three-dimensional density distribution of protein backbone atoms and predicts the probability of 20 natural amino acids for each residue in a protein. The accuracy of DenseCPD was 51.56±0.20% in a 5-fold cross validation on the training set and 54.45% and 50.06% on two independent test sets, which is more than 10% higher than those of previous state-of-the-art methods. Two approaches for using DenseCPD predictions in computational protein design were analyzed. The approach using the cutoff of accumulative probability had a smaller sequence search space compared to that of the approach that simply uses the top-k predictions and therefore enables higher sequence identity in redesigning three proteins with Rosetta. The network and the data sets are available on a web server at <a href="http://protein.org.cn/densecpd.html">http://protein.org.cn/densecpd.html</a>. The results of this study may benefit the further development of computational protein design methods.</p>

Download Full-text

DenseCPD: Improving the Accuracy of Neural-Network-Based Computational Protein Sequence Design with DenseNet

10.26434/chemrxiv.11626098.v1 ◽

2020 ◽

Author(s):

Yifei Qi ◽

John Z.H. Zhang

Keyword(s):

Neural Network ◽

Protein Design ◽

Protein Sequence ◽

Protein Structures ◽

Three Dimensional ◽

Search Space ◽

Computational Protein Design ◽

Data Sets ◽

Protein Backbone ◽

Natural Amino Acids

<p>Computational protein design remains a challenging task despite its remarkable success in the past few decades. With the rapid progress of deep-learning techniques and the accumulation of three-dimensional protein structures, using deep neural networks to learn the relationship between protein sequences and structures and then automatically design a protein sequence for a given protein backbone structure is becoming increasingly feasible. In this study, we developed a deep neural network named DenseCPD that considers the three-dimensional density distribution of protein backbone atoms and predicts the probability of 20 natural amino acids for each residue in a protein. The accuracy of DenseCPD was 51.56±0.20% in a 5-fold cross validation on the training set and 54.45% and 50.06% on two independent test sets, which is more than 10% higher than those of previous state-of-the-art methods. Two approaches for using DenseCPD predictions in computational protein design were analyzed. The approach using the cutoff of accumulative probability had a smaller sequence search space compared to that of the approach that simply uses the top-k predictions and therefore enables higher sequence identity in redesigning three proteins with Rosetta. The network and the data sets are available on a web server at <a href="http://protein.org.cn/densecpd.html">http://protein.org.cn/densecpd.html</a>. The results of this study may benefit the further development of computational protein design methods.</p>

Download Full-text

De novo protein design by deep network hallucination

10.1101/2020.07.22.211482 ◽

2020 ◽

Cited By ~ 2

Author(s):

Ivan Anishchenko ◽

Tamuka M. Chidyausiku ◽

Sergey Ovchinnikov ◽

Samuel J. Pellock ◽

David Baker

Keyword(s):

Amino Acid ◽

Protein Design ◽

Structure Prediction ◽

De Novo ◽

Protein Structures ◽

Monte Carlo Sampling ◽

Amino Acid Sequences ◽

Wide Range ◽

Physically Based ◽

Folded Proteins

AbstractThere has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

Download Full-text

Predicting secondary structures, contact numbers, and residue-wise contact orders of native protein structures from amino acid sequences using critical random networks

BIOPHYSICS ◽

10.2142/biophysics.1.67 ◽

2005 ◽

Vol 1 ◽

pp. 67-74 ◽

Cited By ~ 14

Author(s):

Akira R. Kinjo ◽

Ken Nishikawa

Keyword(s):

Amino Acid ◽

Protein Structures ◽

Secondary Structures ◽

Amino Acid Sequences ◽

Random Networks ◽

Native Protein

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

In-silicoprediction and modeling of theEntamoeba histolyticaproteins: Serine-richEntamoeba histolyticaprotein and 29 kDa Cysteine-rich protease

PeerJ ◽

10.7717/peerj.3160 ◽

2017 ◽

Vol 5 ◽

pp. e3160 ◽

Cited By ~ 5

Author(s):

Kumar Manochitra ◽

Subhash Chandra Parija

Keyword(s):

Amino Acid ◽

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Amino Acid Sequences ◽

Treatment Modalities ◽

Bioinformatic Tools ◽

Complex Protein ◽

A Cell ◽

Quaternary Structures

BackgroundAmoebiasis is the third most common parasitic cause of morbidity and mortality, particularly in countries with poor hygienic settings. There exists an ambiguity in the diagnosis of amoebiasis, and hence there arises a necessity for a better diagnostic approach. Serine-richEntamoeba histolyticaprotein (SREHP), peroxiredoxin and Gal/GalNAc lectin are pivotal inE. histolyticavirulence and are extensively studied as diagnostic and vaccine targets. For elucidating the cellular function of these proteins, details regarding their respective quaternary structures are essential. However, studies in this aspect are scant. Hence, this study was carried out to predict the structure of these target proteins and characterize them structurally as well as functionally using appropriatein-silicomethods.MethodsThe amino acid sequences of the proteins were retrieved from National Centre for Biotechnology Information database and aligned using ClustalW. Bioinformatic tools were employed in the secondary structure and tertiary structure prediction. The predicted structure was validated, and final refinement was carried out.ResultsThe protein structures predicted by i-TASSER were found to be more accurate than Phyre2 based on the validation using SAVES server. The prediction suggests SREHP to be an extracellular protein, peroxiredoxin a peripheral membrane protein while Gal/GalNAc lectin was found to be a cell-wall protein. Signal peptides were found in the amino-acid sequences of SREHP and Gal/GalNAc lectin, whereas they were not present in the peroxiredoxin sequence. Gal/GalNAc lectin showed better antigenicity than the other two proteins studied. All the three proteins exhibited similarity in their structures and were mostly composed of loops.DiscussionThe structures of SREHP and peroxiredoxin were predicted successfully, while the structure of Gal/GalNAc lectin could not be predicted as it was a complex protein composed of sub-units. Also, this protein showed less similarity with the available structural homologs. The quaternary structures of SREHP and peroxiredoxin predicted from this study would provide better structural and functional insights into these proteins and may aid in development of newer diagnostic assays or enhancement of the available treatment modalities.

Download Full-text

FIND: Identifying Functionally and Structurally Important Features in Protein Sequences with Deep Neural Networks

10.1101/592808 ◽

2019 ◽

Author(s):

Ranjani Murali ◽

James Hemp ◽

Victoria Orphan ◽

Yonatan Bisk

Keyword(s):

Neural Networks ◽

Amino Acid ◽

Hidden Markov Models ◽

Markov Models ◽

Genomic Sequence ◽

Hidden Markov ◽

Amino Acid Sequences ◽

Homologous Proteins ◽

Biological Studies ◽

Insight Into

AbstractThe ability to correctly predict the functional role of proteins from their amino acid sequences would significantly advance biological studies at the molecular level by improving our ability to understand the biochemical capability of biological organisms from their genomic sequence. Existing methods that are geared towards protein function prediction or annotation mostly use alignment-based approaches and probabilistic models such as Hidden-Markov Models. In this work we introduce a deep learning architecture (FunctionIdentification withNeuralDescriptions orFIND) which performs protein annotation from primary sequence. The accuracy of our methods matches state of the art techniques, such as protein classifiers based on Hidden Markov Models. Further, our approach allows for model introspection via a neural attention mechanism, which weights parts of the amino acid sequence proportionally to their relevance for functional assignment. In this way, the attention weights automatically uncover structurally and functionally relevant features of the classified protein and find novel functional motifs in previously uncharacterized proteins. While this model is applicable to any database of proteins, we chose to apply this model to superfamilies of homologous proteins, with the aim of extracting features inherent to divergent protein families within a larger superfamily. This provided insight into the functional diversification of an enzyme superfamily and its adaptation to different physiological contexts. We tested our approach on three families (nitrogenases, cytochromebd-type oxygen reductases and heme-copper oxygen reductases) and present a detailed analysis of the sequence characteristics identified in previously characterized proteins in the heme-copper oxygen reductase (HCO) superfamily. These are correlated with their catalytic relevance and evolutionary history. FIND was then applied to discover features in previously uncharacterized members of the HCO superfamily, providing insight into their unique sequence features. This modeling approach demonstrates the power of neural networks to recognize patterns in large datasets and can be utilized to discover biochemically and structurally important features in proteins from their amino acid sequences.Author summary

Download Full-text

Computational Protein Design Quantifies Structural Constraints on Amino Acid Covariation

PLoS Computational Biology ◽

10.1371/journal.pcbi.1003313 ◽

2013 ◽

Vol 9 (11) ◽

pp. e1003313 ◽

Cited By ~ 24

Author(s):

Noah Ollikainen ◽

Tanja Kortemme

Keyword(s):

Amino Acid ◽

Protein Design ◽

Computational Protein Design ◽

Structural Constraints

Download Full-text