Fragger: a protein fragment picker for structural queries

Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.

Download Full-text

A structural homology approach for computational protein design with flexible backbone

Bioinformatics ◽

10.1093/bioinformatics/bty975 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2418-2426 ◽

Cited By ~ 2

Author(s):

David Simoncini ◽

Kam Y J Zhang ◽

Thomas Schiex ◽

Sophie Barbe

Keyword(s):

Amino Acid ◽

Protein Design ◽

Protein Sequence ◽

Critical Role ◽

Protein Structures ◽

Amino Acid Sequences ◽

Computational Protein Design ◽

Supplementary Information ◽

Structural Homology ◽

Homologous Proteins

Abstract Motivation Structure-based Computational Protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. Energy functions remain however imperfect and injecting relevant information from known structures in the design process should lead to improved designs. Results We introduce Shades, a data-driven CPD method that exploits local structural environments in known protein structures together with energy to guide sequence design, while sampling side-chain and backbone conformations to accommodate mutations. Shades (Structural Homology Algorithm for protein DESign), is based on customized libraries of non-contiguous in-contact amino acid residue motifs. We have tested Shades on a public benchmark of 40 proteins selected from different protein families. When excluding homologous proteins, Shades achieved a protein sequence recovery of 30% and a protein sequence similarity of 46% on average, compared with the PFAM protein family of the target protein. When homologous structures were added, the wild-type sequence recovery rate achieved 93%. Availability and implementation Shades source code is available at https://bitbucket.org/satsumaimo/shades as a patch for Rosetta 3.8 with a curated protein structure database and ITEM library creation software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Challenges in the computational design of proteins

Journal of The Royal Society Interface ◽

10.1098/rsif.2008.0508.focus ◽

2009 ◽

Vol 6 (suppl_4) ◽

Cited By ~ 38

Author(s):

María Suárez ◽

Alfonso Jaramillo

Keyword(s):

Amino Acid ◽

Structural Biology ◽

Protein Design ◽

Current Knowledge ◽

Computational Design ◽

Amino Acid Sequences ◽

Computational Protein Design ◽

Energy Functions ◽

Physical Description ◽

Atomic Interactions

Protein design has many applications not only in biotechnology but also in basic science. It uses our current knowledge in structural biology to predict, by computer simulations, an amino acid sequence that would produce a protein with targeted properties. As in other examples of synthetic biology, this approach allows the testing of many hypotheses in biology. The recent development of automated computational methods to design proteins has enabled proteins to be designed that are very different from any known ones. Moreover, some of those methods mostly rely on a physical description of atomic interactions, which allows the designed sequences not to be biased towards known proteins. In this paper, we will describe the use of energy functions in computational protein design, the use of atomic models to evaluate the free energy in the unfolded and folded states, the exploration and optimization of amino acid sequences, the problem of negative design and the design of biomolecular function. We will also consider its use together with the experimental techniques such as directed evolution. We will end by discussing the challenges ahead in computational protein design and some of their future applications.

Download Full-text

Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1707642114 ◽

2017 ◽

Vol 114 (44) ◽

pp. 11703-11708 ◽

Cited By ~ 27

Author(s):

Sergey Nepomnyachiy ◽

Nir Ben-Tal ◽

Rachel Kolodny

Keyword(s):

Amino Acid ◽

Protein Design ◽

De Novo ◽

Data Bank ◽

Similar Sequence ◽

Design Data ◽

Structural Domains ◽

Evolutionary Advantage ◽

Protein Alignments

Proteins share similar segments with one another. Such “reused parts”—which have been successfully incorporated into other proteins—are likely to offer an evolutionary advantage over de novo evolved segments, as most of the latter will not even have the capacity to fold. To systematically explore the evolutionary traces of segment “reuse” across proteins, we developed an automated methodology that identifies reused segments from protein alignments. We search for “themes”—segments of at least 35 residues of similar sequence and structure—reused within representative sets of 15,016 domains [Evolutionary Classification of Protein Domains (ECOD) database] or 20,398 chains [Protein Data Bank (PDB)]. We observe that theme reuse is highly prevalent and that reuse is more extensive when the length threshold for identifying a theme is lower. Structural domains, the best characterized form of reuse in proteins, are just one of many complex and intertwined evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in numerous proteins. The observed complexity is consistent with evolution by duplication and divergence, and some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, could be a useful concept for protein design. Data are available at http://trachel-srv.cs.haifa.ac.il/rachel/ppi/themes/.

Download Full-text

ff19SB: Amino-Acid Specific Protein Backbone Parameters Trained Against Quantum Mechanics Energy Surfaces in Solution

10.26434/chemrxiv.8279681 ◽

2019 ◽

Cited By ~ 3

Author(s):

Chuan Tian ◽

Koushik Kasavajhala ◽

Kellon Belfon ◽

Lauren Raguette ◽

He Huang ◽

...

Keyword(s):

Amino Acids ◽

Quantum Mechanics ◽

Amino Acid ◽

Protein Design ◽

Md Simulations ◽

Data Bank ◽

Specific Protein ◽

Water Model ◽

Solvent Polarization ◽

Energy Surfaces

<p>Molecular dynamics (MD) simulations have become increasingly popular in studying the motions and functions of biomolecules. The accuracy of the simulation, however, is highly determined by the molecular mechanics (MM) force field (FF), a set of functions with adjustable parameters to compute the potential energies from atomic positions. However, the overall quality of the FF, such as our previously published ff99SB and ff14SB, can be limited by assumptions that were made years ago. In the updated model presented here (ff19SB), we have significantly improved the backbone profiles for all 20 amino acids. We fit coupled ϕ/ψ parameters using 2D ϕ/ψ conformational scans for multiple amino acids, using as reference data the entire 2D quantum mechanics (QM) energy surface. We address the polarization inconsistency during dihedral parameter fitting by using both QM and MM in solution. Finally, we examine possible dependency of the backbone fitting on side chain rotamer. To extensively validate ff19SB parameters, we have performed a total of ~5 milliseconds MD simulations in explicit solvent. Our results show that after amino-acid specific training against QM data with solvent polarization, ff19SB not only reproduces the differences in amino acid specific Protein Data Bank (PDB) Ramachandran maps better, but also shows significantly improved capability to differentiate amino acid dependent properties such as helical propensities. We also conclude that an inherent underestimation of helicity is present in ff14SB, which is (inexactly) compensated by an increase in helical content driven by the TIP3P bias toward overly compact structures. In summary, ff19SB, when combined with a more accurate water model such as OPC, should have better predictive power for modeling sequence-specific behavior, protein mutations, and also rational protein design. </p>

Download Full-text

Atypical Structural Tendencies Among Low-Complexity Domains in the Protein Data Bank Proteome

10.1101/807438 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sean M. Cascarina ◽

Mikaela R. Elder ◽

Eric D. Ross

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Secondary Structure ◽

Physical Properties ◽

Protein Data Bank ◽

Data Bank ◽

Low Complexity ◽

Amino Acid Sequences ◽

Single Amino Acid ◽

Intrinsically Disordered

AbstractA variety of studies have suggested that low-complexity domains (LCDs) tend to be intrinsically disordered and are relatively rare within structured proteins in the protein data bank (PDB). Although LCDs are often treated as a single class, we previously found that LCDs enriched in different amino acids can exhibit substantial differences in protein metabolism and function. Therefore, we wondered whether the structural conformations of LCDs are likewise dependent on which specific amino acids are enriched within each LCD. Here, we directly examined relationships between enrichment of individual amino acids and secondary structure preferences across the entire PDB proteome. Secondary structure preferences varied as a function of the identity of the amino acid enriched and its degree of enrichment. Furthermore, divergence in secondary structure profiles often occurred for LCDs enriched in physicochemically similar amino acids (e.g. valine vs. leucine), indicating that LCDs composed of related amino acids can have distinct secondary structure preferences. Comparison of LCD secondary structure preferences with numerous pre-existing secondary structure propensity scales resulted in relatively poor correlations for certain types of LCDs, indicating that these scales may not capture secondary structure preferences as sequence complexity decreases. Collectively, these observations provide a highly resolved view of structural preferences among LCDs parsed by the nature and magnitude of single amino acid enrichment.Author SummaryThe structures that proteins adopt are directly related to their amino acid sequences. Low-complexity domains (LCDs) in protein sequences are unusual regions made up of only a few different types of amino acids. Although this is the key feature that classifies sequences as LCDs, the physical properties of LCDs will differ based on the types of amino acids that are found in each domain. For example, the sequences “AAAAAAAAAA”, “EEEEEEEEEE”, and “EEKRKEEEKE” will have very different properties, even though they would all be classified as LCDs by traditional methods. In a previous study, we developed a new method to further divide LCDs into categories that more closely reflect the differences in their physical properties. In this study, we apply that approach to examine the structures of LCDs when sorted into different categories based on their amino acids. This allowed us to define relationships between the types of amino acids in the LCDs and their corresponding structures. Since protein structure is closely related to protein function, this has important implications for understanding the basic functions and properties of LCDs in a variety of proteins.

Download Full-text

De novo protein design by deep network hallucination

10.1101/2020.07.22.211482 ◽

2020 ◽

Cited By ~ 2

Author(s):

Ivan Anishchenko ◽

Tamuka M. Chidyausiku ◽

Sergey Ovchinnikov ◽

Samuel J. Pellock ◽

David Baker

Keyword(s):

Amino Acid ◽

Protein Design ◽

Structure Prediction ◽

De Novo ◽

Protein Structures ◽

Monte Carlo Sampling ◽

Amino Acid Sequences ◽

Wide Range ◽

Physically Based ◽

Folded Proteins

AbstractThere has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

Download Full-text

Massively parallel interrogation of protein fragment secretability using SECRiFY reveals features influencing secretory system transit

10.1101/241349 ◽

2018 ◽

Author(s):

M. Boone ◽

P. Ramasamy ◽

J. Zuallaert ◽

R. Bouwmeester ◽

B. Van Moer ◽

...

Keyword(s):

Protein Design ◽

De Novo ◽

Recombinant Protein Expression ◽

Surface Display ◽

Protein Fragment ◽

Chain Flexibility ◽

Protein Fragments ◽

Protein Biogenesis ◽

Secretory System ◽

Fragment Libraries

AbstractWhile transcriptome- and proteome-wide technologies to assess processes in protein biogenesis are now widely available, we still lack global approaches to assay post-ribosomal biogenesis events, in particular those occurring in the eukaryotic secretory system. We here developed a method, SECRiFY, to simultaneously assess the secretability of >105 protein fragments by two yeast species, S. cerevisiae and P. pastoris, using custom fragment libraries, surface display and a sequencing-based readout. Screening human proteome fragments with a median size of 50 - 100 amino acids, we generated datasets that enable datamining into protein features underlying secretability, revealing a striking role for intrinsic disorder and chain flexibility. SECRiFY is the first methodology that generates sufficient amounts of annotated data for advanced machine learning methods to deduce secretability predictors. The finding that secretability is indeed a learnable feature of protein sequences is of significant impact in the broad area of recombinant protein expression and de novo protein design.

Download Full-text

Size and structure of the sequence space of repeat proteins

10.1101/635581 ◽

2019 ◽

Author(s):

Jacopo Marchi ◽

Ezequiel A. Galpern ◽

Rocio Espada ◽

Diego U. Ferreiro ◽

Aleksandra M. Walczak ◽

...

Keyword(s):

Amino Acid ◽

Protein Design ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Repeat Proteins ◽

The Impact ◽

New Strategies ◽

Amino Acid Conservation

AbstractThe coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family —the total number of sequences in that family— can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

Download Full-text

ff19SB: Amino-Acid Specific Protein Backbone Parameters Trained Against Quantum Mechanics Energy Surfaces in Solution

10.26434/chemrxiv.8279681.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chuan Tian ◽

Koushik Kasavajhala ◽

Kellon Belfon ◽

Lauren Raguette ◽

He Huang ◽

...

Keyword(s):

Amino Acids ◽

Quantum Mechanics ◽

Amino Acid ◽

Protein Design ◽

Md Simulations ◽

Data Bank ◽

Specific Protein ◽

Water Model ◽

Solvent Polarization ◽

Energy Surfaces

<p>Molecular dynamics (MD) simulations have become increasingly popular in studying the motions and functions of biomolecules. The accuracy of the simulation, however, is highly determined by the molecular mechanics (MM) force field (FF), a set of functions with adjustable parameters to compute the potential energies from atomic positions. However, the overall quality of the FF, such as our previously published ff99SB and ff14SB, can be limited by assumptions that were made years ago. In the updated model presented here (ff19SB), we have significantly improved the backbone profiles for all 20 amino acids. We fit coupled ϕ/ψ parameters using 2D ϕ/ψ conformational scans for multiple amino acids, using as reference data the entire 2D quantum mechanics (QM) energy surface. We address the polarization inconsistency during dihedral parameter fitting by using both QM and MM in solution. Finally, we examine possible dependency of the backbone fitting on side chain rotamer. To extensively validate ff19SB parameters, we have performed a total of ~5 milliseconds MD simulations in explicit solvent. Our results show that after amino-acid specific training against QM data with solvent polarization, ff19SB not only reproduces the differences in amino acid specific Protein Data Bank (PDB) Ramachandran maps better, but also shows significantly improved capability to differentiate amino acid dependent properties such as helical propensities. We also conclude that an inherent underestimation of helicity is present in ff14SB, which is (inexactly) compensated by an increase in helical content driven by the TIP3P bias toward overly compact structures. In summary, ff19SB, when combined with a more accurate water model such as OPC, should have better predictive power for modeling sequence-specific behavior, protein mutations, and also rational protein design. </p>

Download Full-text