Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths

ABSTRACTProtein design is a powerful tool for elucidating mechanisms of function and engineering new therapeutics and nanotechnologies. While soluble protein design has advanced, membrane protein design remains challenging due to difficulties in modeling the lipid bilayer. In this work, we developed an implicit approach that captures the anisotropic structure, shape of water-filled pores, and nanoscale dimensions of membranes with different lipid compositions. The model improves performance in computational bench-marks against experimental targets including prediction of protein orientations in the bilayer, ΔΔG calculations, native structure dis-crimination, and native sequence recovery. When applied to de novo protein design, this approach designs sequences with an amino acid distribution near the native amino acid distribution in membrane proteins, overcoming a critical flaw in previous membrane models that were prone to generating leucine-rich designs. Further, the proteins designed in the new membrane model exhibit native-like features including interfacial aromatic side chains, hydrophobic lengths compatible with bilayer thickness, and polar pores. Our method advances high-resolution membrane protein structure prediction and design toward tackling key biological questions and engineering challenges.Significance StatementMembrane proteins participate in many life processes including transport, signaling, and catalysis. They constitute over 30% of all proteins and are targets for over 60% of pharmaceuticals. Computational design tools for membrane proteins will transform the interrogation of basic science questions such as membrane protein thermodynamics and the pipeline for engineering new therapeutics and nanotechnologies. Existing tools are either too expensive to compute or rely on manual design strategies. In this work, we developed a fast and accurate method for membrane protein design. The tool is available to the public and will accelerate the experimental design pipeline for membrane proteins.

Download Full-text

Fragger: a protein fragment picker for structural queries

F1000Research ◽

10.12688/f1000research.12486.2 ◽

2018 ◽

Vol 6 ◽

pp. 1722 ◽

Cited By ~ 1

Author(s):

Francois Berenger ◽

David Simoncini ◽

Arnout Voet ◽

Rojan Shrestha ◽

Kam Y.J. Zhang

Keyword(s):

Amino Acid ◽

Protein Design ◽

Data Bank ◽

Amino Acid Sequences ◽

Structural Fragment ◽

Protein Fragment ◽

Distance Threshold ◽

Protein Fragments ◽

Specific Subset ◽

Design Activities

Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.

Download Full-text

ff19SB: Amino-Acid Specific Protein Backbone Parameters Trained Against Quantum Mechanics Energy Surfaces in Solution

10.26434/chemrxiv.8279681 ◽

2019 ◽

Cited By ~ 3

Author(s):

Chuan Tian ◽

Koushik Kasavajhala ◽

Kellon Belfon ◽

Lauren Raguette ◽

He Huang ◽

...

Keyword(s):

Amino Acids ◽

Quantum Mechanics ◽

Amino Acid ◽

Protein Design ◽

Md Simulations ◽

Data Bank ◽

Specific Protein ◽

Water Model ◽

Solvent Polarization ◽

Energy Surfaces

<p>Molecular dynamics (MD) simulations have become increasingly popular in studying the motions and functions of biomolecules. The accuracy of the simulation, however, is highly determined by the molecular mechanics (MM) force field (FF), a set of functions with adjustable parameters to compute the potential energies from atomic positions. However, the overall quality of the FF, such as our previously published ff99SB and ff14SB, can be limited by assumptions that were made years ago. In the updated model presented here (ff19SB), we have significantly improved the backbone profiles for all 20 amino acids. We fit coupled ϕ/ψ parameters using 2D ϕ/ψ conformational scans for multiple amino acids, using as reference data the entire 2D quantum mechanics (QM) energy surface. We address the polarization inconsistency during dihedral parameter fitting by using both QM and MM in solution. Finally, we examine possible dependency of the backbone fitting on side chain rotamer. To extensively validate ff19SB parameters, we have performed a total of ~5 milliseconds MD simulations in explicit solvent. Our results show that after amino-acid specific training against QM data with solvent polarization, ff19SB not only reproduces the differences in amino acid specific Protein Data Bank (PDB) Ramachandran maps better, but also shows significantly improved capability to differentiate amino acid dependent properties such as helical propensities. We also conclude that an inherent underestimation of helicity is present in ff14SB, which is (inexactly) compensated by an increase in helical content driven by the TIP3P bias toward overly compact structures. In summary, ff19SB, when combined with a more accurate water model such as OPC, should have better predictive power for modeling sequence-specific behavior, protein mutations, and also rational protein design. </p>

Download Full-text

RamaNet: Computational de novo helical protein backbone design using a long short-term memory generative neural network

10.1101/671552 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sari Sabban ◽

Mikhail Markovsky

Keyword(s):

Neural Network ◽

Protein Data Bank ◽

Protein Design ◽

Short Term Memory ◽

De Novo ◽

Protein Structures ◽

Data Bank ◽

Protein Backbone ◽

Helical Protein ◽

Long Short Term Memory

AbstractThe ability to perform de novo protein design will allow researchers to expand the variety of available proteins. By designing synthetic structures computationally, they can utilise more structures than those available in the Protein Data Bank, design structures that are not found in nature, or direct the design of proteins to acquire a specific desired structure. While some researchers attempt to design proteins from first physical and thermodynamic principals, we decided to attempt to test whether it is possible to perform de novo helical protein design of just the backbone statistically using machine learning by building a model that uses a long short-term memory (LSTM) architecture. The LSTM model used only the ϕ and ψ angles of each residue from an augmented dataset of only helical protein structures. Though the network’s generated backbone structures were not perfect, they were idealised and evaluated post generation where the non-ideal structures were filtered out and the adequate structures kept. The results were successful in developing a logical, rigid, compact, helical protein backbone topology. This paper is a proof of concept that shows it is possible to generate a novel helical backbone topology using an LSTM neural network architecture using only the ϕ and ψ angles as features. The next step is to attempt to use these backbone topologies and sequence design them to form complete protein structures.Author summaryThis research project stemmed from the desire to expand the pool of protein structures that can be used as scaffolds in computational vaccine development, since the number of structures available from the Protein Data Bank was not sufficient to allow for great diversity and increase the probability of grafting a target motif onto a protein scaffold. Since a protein structure’s backbone can be defined by the ϕ and ψ angles of each amino acid in the polypeptide and can effectively translate a protein’s 3D structure into a table of numbers, and since protein structures are not random, this numerical representation of protein structures can be used to train a neural network to mathematically generalise what a protein structure is, and therefore generate new a protein backbone. Instead of using all proteins in the Protein Data Bank a curated dataset was used encompassing protein structures with specific characteristics that will, theoretically, allow them to be evaluated computationally. This paper details how a trained neural network was able to successfully generate helical protein backbones.

Download Full-text

De novo protein design by deep network hallucination

10.1101/2020.07.22.211482 ◽

2020 ◽

Cited By ~ 2

Author(s):

Ivan Anishchenko ◽

Tamuka M. Chidyausiku ◽

Sergey Ovchinnikov ◽

Samuel J. Pellock ◽

David Baker

Keyword(s):

Amino Acid ◽

Protein Design ◽

Structure Prediction ◽

De Novo ◽

Protein Structures ◽

Monte Carlo Sampling ◽

Amino Acid Sequences ◽

Wide Range ◽

Physically Based ◽

Folded Proteins

AbstractThere has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

Download Full-text

Principal component analysis of alpha-helix deformations in transmembrane proteins

PLoS ONE ◽

10.1371/journal.pone.0257318 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257318

Author(s):

Alexander Bevacqua ◽

Sachit Bakshi ◽

Yu Xia

Keyword(s):

Principal Component Analysis ◽

Protein Design ◽

De Novo ◽

Alpha Helix ◽

Principal Component ◽

Data Bank ◽

Component Analysis ◽

Structural Components ◽

Α Helix ◽

Deformation Modes

α-helices are deformable secondary structural components regularly observed in protein folds. The overall flexibility of an α-helix can be resolved into constituent physical deformations such as bending in two orthogonal planes and twisting along the principal axis. We used Principal Component Analysis to identify and quantify the contribution of each of these dominant deformation modes in transmembrane α-helices, extramembrane α-helices, and α-helices in soluble proteins. Using three α-helical samples from Protein Data Bank entries spanning these three cellular contexts, we determined that the relative contributions of these modes towards total deformation are independent of the α-helix’s surroundings. This conclusion is supported by the observation that the identities of the top three deformation modes, the scaling behaviours of mode eigenvalues as a function of α-helix length, and the percentage contribution of individual modes on total variance were comparable across all three α-helical samples. These findings highlight that α-helical deformations are independent of cellular location and will prove to be valuable in furthering the development of flexible templates in de novo protein design.

Download Full-text

ff19SB: Amino-Acid Specific Protein Backbone Parameters Trained Against Quantum Mechanics Energy Surfaces in Solution

10.26434/chemrxiv.8279681.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chuan Tian ◽

Koushik Kasavajhala ◽

Kellon Belfon ◽

Lauren Raguette ◽

He Huang ◽

...

Keyword(s):

Amino Acids ◽

Quantum Mechanics ◽

Amino Acid ◽

Protein Design ◽

Md Simulations ◽

Data Bank ◽

Specific Protein ◽

Water Model ◽

Solvent Polarization ◽

Energy Surfaces

<p>Molecular dynamics (MD) simulations have become increasingly popular in studying the motions and functions of biomolecules. The accuracy of the simulation, however, is highly determined by the molecular mechanics (MM) force field (FF), a set of functions with adjustable parameters to compute the potential energies from atomic positions. However, the overall quality of the FF, such as our previously published ff99SB and ff14SB, can be limited by assumptions that were made years ago. In the updated model presented here (ff19SB), we have significantly improved the backbone profiles for all 20 amino acids. We fit coupled ϕ/ψ parameters using 2D ϕ/ψ conformational scans for multiple amino acids, using as reference data the entire 2D quantum mechanics (QM) energy surface. We address the polarization inconsistency during dihedral parameter fitting by using both QM and MM in solution. Finally, we examine possible dependency of the backbone fitting on side chain rotamer. To extensively validate ff19SB parameters, we have performed a total of ~5 milliseconds MD simulations in explicit solvent. Our results show that after amino-acid specific training against QM data with solvent polarization, ff19SB not only reproduces the differences in amino acid specific Protein Data Bank (PDB) Ramachandran maps better, but also shows significantly improved capability to differentiate amino acid dependent properties such as helical propensities. We also conclude that an inherent underestimation of helicity is present in ff14SB, which is (inexactly) compensated by an increase in helical content driven by the TIP3P bias toward overly compact structures. In summary, ff19SB, when combined with a more accurate water model such as OPC, should have better predictive power for modeling sequence-specific behavior, protein mutations, and also rational protein design. </p>

Download Full-text

Protein sequence design by explicit energy landscape optimization

10.1101/2020.07.23.218917 ◽

2020 ◽

Cited By ~ 1

Author(s):

Christoffer Norn ◽

Basile I. M. Wicky ◽

David Juergens ◽

Sirui Liu ◽

David Kim ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Protein Design ◽

Structure Prediction ◽

De Novo ◽

Large Fraction ◽

Amino Acid Sequences ◽

Alternative States ◽

Point Energy ◽

Sequence Design

AbstractThe protein design problem is to identify an amino acid sequence which folds to a desired structure. Given Anfinsen’s thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the lowest energy conformation is that structure. As this calculation involves not only all possible amino acid sequences but also all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest energy conformation for the designed sequence, and discarding the in many cases large fraction of designed sequences for which this is not the case. Here we show that by backpropagating gradients through the trRosetta structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures, and in one calculation explicitly design amino acid sequences predicted to fold into the desired structure and not any other. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by landscape optimization to the standard fixed backbone sequence design methodology in Rosetta, and show that the results of the former, but not the latter, are sensitive to the presence of competing low-lying states. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low resolution trRosetta model serves to disfavor alternative states, and the high resolution Rosetta model, to create a deep energy minimum at the design target structure.SignificanceComputational protein design has primarily focused on finding sequences which have very low energy in the target designed structure. However, what is most relevant during folding is not the absolute energy of the folded state, but the energy difference between the folded state and the lowest lying alternative states. We describe a deep learning approach which captures the entire folding landscape, and show that it can enhance current protein design methods.

Download Full-text

New families in the classification of glycosyl hydrolases based on amino acid sequence similarities

Biochemical Journal ◽

10.1042/bj2930781 ◽

1993 ◽

Vol 293 (3) ◽

pp. 781-788 ◽

Cited By ~ 1335

Author(s):

B Henrissat ◽

A Bairoch

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Protein Sequence ◽

Sequence Data ◽

Data Bank ◽

The Other ◽

Glycosyl Hydrolases ◽

Protein Sequence Data ◽

Sequence Similarities

301 glycosyl hydrolases and related enzymes corresponding to 39 EC entries of the I.U.B. classification system have been classified into 35 families on the basis of amino-acid-sequence similarities [Henrissat (1991) Biochem. J. 280, 309-316]. Approximately half of the families were found to be monospecific (containing only one EC number), whereas the other half were found to be polyspecific (containing at least two EC numbers). A > 60% increase in sequence data for glycosyl hydrolases (181 additional enzymes or enzyme domains sequences have since become available) allowed us to update the classification not only by the addition of more members to already identified families, but also by the finding of ten new families. On the basis of a comparison of 482 sequences corresponding to 52 EC entries, 45 families, out of which 22 are polyspecific, can now be defined. This classification has been implemented in the SWISS-PROT protein sequence data bank.

Download Full-text

Fragger: a protein fragment picker for structural queries

F1000Research ◽

10.12688/f1000research.12486.1 ◽

2017 ◽

Vol 6 ◽

pp. 1722

Author(s):

Francois Berenger ◽

David Simoncini ◽

Arnout Voet ◽

Rojan Shrestha ◽

Kam Y.J. Zhang

Keyword(s):

Amino Acid ◽

Protein Design ◽

Data Bank ◽

Amino Acid Sequences ◽

Structural Fragment ◽

Protein Fragment ◽

Distance Threshold ◽

Protein Fragments ◽

Specific Subset ◽

Design Activities

Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.

Download Full-text