Multi-Scale Structural Analysis of Proteins by Deep Semantic Segmentation

Mapping Intimacies ◽

10.1101/474627 ◽

2018 ◽

Author(s):

Raphael R. Eguchi ◽

Po-Ssu Huang

Keyword(s):

Image Classification ◽

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Semantic Segmentation ◽

Amino Acid Sequences ◽

Structural Quality ◽

Small Subset ◽

Structural Prediction

AbstractRecent advancements in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds, and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation — a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structural quality assessment. We represent protein structures as 2D α-carbon distance matrices (“contact maps”), and train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model performs exceptionally well, achieving a per-residue accuracy of 90.8% on the test set (95.0% average accuracy over all classes; 87.8% average within-structure accuracy). The unique aspect of our classifier is that it encodes sequence agnostic residue environments from the PDB and can assess structural quality as quantitative probabilities. We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design.SignificanceRecent computational advances have allowed researchers to predict the structure of many proteins from their amino acid sequences, as well as designing new sequences that fold into predefined structures. However, these tasks are often challenging because they require selection of a small subset of promising structural models from a large pool of stochastically generated ones. Here, we describe a novel approach to protein model selection that uses 2D image classification techniques to evaluate 3D protein models. Our method can be used to select structures based on the fold that they adopt, and can also be used to identify regions of low structural quality. These capabilities yield a powerful tool for both protein design and structure prediction.

Download Full-text

Multi-scale structural analysis of proteins by deep semantic segmentation

Bioinformatics ◽

10.1093/bioinformatics/btz650 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1740-1749 ◽

Cited By ~ 2

Author(s):

Raphael R Eguchi ◽

Po-Ssu Huang

Keyword(s):

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Semantic Segmentation ◽

Structural Features ◽

Supplementary Information ◽

Structural Prediction ◽

Structure Accuracy ◽

High Level

Abstract Motivation Recent advances in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation—a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structure quality assessment. Results We train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model achieves a high per-residue accuracy of 90.8% on the test set (95.0% average per-class accuracy; 87.8% average per-structure accuracy). We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design. Availability and implementation The trained classifier network, parser network, and entropy calculation scripts are available for download at https://git.io/fp6bd, with detailed usage instructions provided at the download page. A step-by-step tutorial for setup is provided at https://goo.gl/e8GB2S. All Rosetta commands, RosettaRemodel blueprints, and predictions for all datasets used in the study are available in the Supplementary Information. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De novo protein design, a retrospective

Quarterly Reviews of Biophysics ◽

10.1017/s0033583519000131 ◽

2020 ◽

Vol 53 ◽

Cited By ~ 20

Author(s):

Ivan V. Korendovych ◽

William F. DeGrado

Keyword(s):

Membrane Protein ◽

Protein Design ◽

First Principles ◽

Rational Design ◽

De Novo ◽

Dynamic Properties ◽

Protein Structures ◽

Amino Acid Sequences ◽

Physical Models ◽

And Function

Abstract Proteins are molecular machines whose function depends on their ability to achieve complex folds with precisely defined structural and dynamic properties. The rational design of proteins from first-principles, or de novo, was once considered to be impossible, but today proteins with a variety of folds and functions have been realized. We review the evolution of the field from its earliest days, placing particular emphasis on how this endeavor has illuminated our understanding of the principles underlying the folding and function of natural proteins, and is informing the design of macromolecules with unprecedented structures and properties. An initial set of milestones in de novo protein design focused on the construction of sequences that folded in water and membranes to adopt folded conformations. The first proteins were designed from first-principles using very simple physical models. As computers became more powerful, the use of the rotamer approximation allowed one to discover amino acid sequences that stabilize the desired fold. As the crystallographic database of protein structures expanded in subsequent years, it became possible to construct proteins by assembling short backbone fragments that frequently recur in Nature. The second set of milestones in de novo design involves the discovery of complex functions. Proteins have been designed to bind a variety of metals, porphyrins, and other cofactors. The design of proteins that catalyze hydrolysis and oxygen-dependent reactions has progressed significantly. However, de novo design of catalysts for energetically demanding reactions, or even proteins that bind with high affinity and specificity to highly functionalized complex polar molecules remains an importnant challenge that is now being achieved. Finally, the protein design contributed significantly to our understanding of membrane protein folding and transport of ions across membranes. The area of membrane protein design, or more generally of biomimetic polymers that function in mixed or non-aqueous environments, is now becoming increasingly possible.

Download Full-text

De novo protein design by deep network hallucination

10.1101/2020.07.22.211482 ◽

2020 ◽

Cited By ~ 2

Author(s):

Ivan Anishchenko ◽

Tamuka M. Chidyausiku ◽

Sergey Ovchinnikov ◽

Samuel J. Pellock ◽

David Baker

Keyword(s):

Amino Acid ◽

Protein Design ◽

Structure Prediction ◽

De Novo ◽

Protein Structures ◽

Monte Carlo Sampling ◽

Amino Acid Sequences ◽

Wide Range ◽

Physically Based ◽

Folded Proteins

AbstractThere has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

Download Full-text

Dissecting the stability determinants of a challenging de novo protein fold using massively parallel design and experimentation

10.1101/2021.12.17.472837 ◽

2021 ◽

Author(s):

Tae-Eun Kim ◽

Kotaro Tsuboyama ◽

Scott Houliston ◽

Cydney M. Martell ◽

Claire M. Phoumyvong ◽

...

Keyword(s):

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Design Problems ◽

Hydrogen Deuterium ◽

The Stability ◽

Universal Stability ◽

Scale Design ◽

New Protein

Designing entirely new protein structures remains challenging because we do not fully understand the biophysical determinants of folding stability. Yet some protein folds are easier to design than others. Previous work identified the 43-residue αββ&#945 fold as especially challenging: the best designs had only a 2% success rate, compared to 39-87% success for other simple folds (1). This suggested the αββ&#945 fold would be a useful model system for gaining a deeper understanding of folding stability determinants and for testing new protein design methods. Here, we designed over ten thousand new αββ&#945 proteins and found over three thousand of them to fold into stable structures using a high-throughput protease-based assay. Nuclear magnetic resonance, hydrogen-deuterium exchange, circular dichroism, deep mutational scanning, and scrambled sequence control experiments indicated that our stable designs fold into their designed αββ&#945 structures with exceptional stability for their small size. Our large dataset enabled us to quantify the influence of universal stability determinants including nonpolar burial, helix capping, and buried unsatisfied polar atoms, as well as stability determinants unique to the αββ&#945 topology. Our work demonstrates how large-scale design and test cycles can solve challenging design problems while illuminating the biophysical determinants of folding.

Download Full-text

Protein designer David Baker: I like doing things that seem like magic

National Science Review ◽

10.1093/nsr/nwaa071 ◽

2020 ◽

Vol 7 (8) ◽

pp. 1410-1412

Author(s):

Weijie Zhao ◽

Chu Wang

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

Computational Prediction ◽

Biological Functions ◽

Personal Experiences ◽

De Novo Protein Design ◽

And Function ◽

The University ◽

Opening Up

Abstract Search ‘de novo protein design’ on Google and you will find the name David Baker in all results of the first page. Professor David Baker at the University of Washington and other scientists are opening up a new world of fantastic proteins. Protein is the direct executor of most biological functions and its structure and function are fully determined by its primary sequence. Baker's group developed the Rosetta software suite that enabled the computational prediction and design of protein structures. Being able to design proteins from scratch means being able to design executors for diverse purposes and benefit society in multiple ways. Recently, NSR interviewed Prof. Baker on this fast-developing field and his personal experiences.

Download Full-text

De novo protein design: how do we expand into the universe of possible protein structures?

Current Opinion in Structural Biology ◽

10.1016/j.sbi.2015.05.009 ◽

2015 ◽

Vol 33 ◽

pp. 16-26 ◽

Cited By ~ 110

Author(s):

Derek N Woolfson ◽

Gail J Bartlett ◽

Antony J Burton ◽

Jack W Heal ◽

Ai Niitsu ◽

...

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

De Novo Protein Design ◽

The Universe

Download Full-text

Computational protein design with backbone plasticity

Biochemical Society Transactions ◽

10.1042/bst20160155 ◽

2016 ◽

Vol 44 (5) ◽

pp. 1523-1529 ◽

Cited By ~ 13

Author(s):

James T. MacDonald ◽

Paul S. Freemont

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

Search Space ◽

Computational Protein Design ◽

Artificial Enzymes ◽

Backbone Flexibility ◽

Artificial Proteins ◽

Naturally Occurring ◽

Backbone Structure

The computational algorithms used in the design of artificial proteins have become increasingly sophisticated in recent years, producing a series of remarkable successes. The most dramatic of these is the de novo design of artificial enzymes. The majority of these designs have reused naturally occurring protein structures as ‘scaffolds’ onto which novel functionality can be grafted without having to redesign the backbone structure. The incorporation of backbone flexibility into protein design is a much more computationally challenging problem due to the greatly increased search space, but promises to remove the limitations of reusing natural protein scaffolds. In this review, we outline the principles of computational protein design methods and discuss recent efforts to consider backbone plasticity in the design process.

Download Full-text

Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution

10.1101/2021.04.13.439703 ◽

2021 ◽

Author(s):

Chris Papadopoulos ◽

Isabelle Callebaut ◽

Jean-Christophe Gelly ◽

Isabelle Hatin ◽

Olivier Namy ◽

...

Keyword(s):

Protein Structure ◽

Amino Acid ◽

De Novo ◽

Protein Structures ◽

Structural Diversity ◽

Building Blocks ◽

Amino Acid Sequences ◽

Novel Genes ◽

Noncoding Sequences ◽

De Novo Gene

The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences' properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic ORFs (Open Reading Frames) of S. cerevisiae with the aim of (i) exploring whether the large structural diversity observed in proteomes is already present in noncoding sequences, and (ii) estimating the potential of the noncoding genome to produce novel protein bricks that can either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural diversity of canonical proteins with strikingly the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by identifying intergenic ORFs with a strong translation signal in ribosome profiling experiments and by reconstructing the ancestral sequences of 70 yeast de novo genes. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and the one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.

Download Full-text

A structural homology approach for computational protein design with flexible backbone

Bioinformatics ◽

10.1093/bioinformatics/bty975 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2418-2426 ◽

Cited By ~ 2

Author(s):

David Simoncini ◽

Kam Y J Zhang ◽

Thomas Schiex ◽

Sophie Barbe

Keyword(s):

Amino Acid ◽

Protein Design ◽

Protein Sequence ◽

Critical Role ◽

Protein Structures ◽

Amino Acid Sequences ◽

Computational Protein Design ◽

Supplementary Information ◽

Structural Homology ◽

Homologous Proteins

Abstract Motivation Structure-based Computational Protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. Energy functions remain however imperfect and injecting relevant information from known structures in the design process should lead to improved designs. Results We introduce Shades, a data-driven CPD method that exploits local structural environments in known protein structures together with energy to guide sequence design, while sampling side-chain and backbone conformations to accommodate mutations. Shades (Structural Homology Algorithm for protein DESign), is based on customized libraries of non-contiguous in-contact amino acid residue motifs. We have tested Shades on a public benchmark of 40 proteins selected from different protein families. When excluding homologous proteins, Shades achieved a protein sequence recovery of 30% and a protein sequence similarity of 46% on average, compared with the PFAM protein family of the target protein. When homologous structures were added, the wild-type sequence recovery rate achieved 93%. Availability and implementation Shades source code is available at https://bitbucket.org/satsumaimo/shades as a patch for Rosetta 3.8 with a curated protein structure database and ITEM library creation software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Computational design of closely related proteins that adopt two well-defined but structurally divergent folds

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1914808117 ◽

2020 ◽

Vol 117 (13) ◽

pp. 7208-7215 ◽

Cited By ~ 5

Author(s):

Kathy Y. Wei ◽

Danai Moschidi ◽

Matthew J. Bick ◽

Santrupti Nerli ◽

Andrew C. McShan ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Conformational Transition ◽

Protein Structures ◽

Computational Design ◽

Spectroscopic Characterization ◽

De Novo Design ◽

Viral Fusion ◽

Naturally Occurring ◽

Related Proteins

The plasticity of naturally occurring protein structures, which can change shape considerably in response to changes in environmental conditions, is critical to biological function. While computational methods have been used for de novo design of proteins that fold to a single state with a deep free-energy minimum [P.-S. Huang, S. E. Boyken, D. Baker, Nature 537, 320–327 (2016)], and to reengineer natural proteins to alter their dynamics [J. A. Davey, A. M. Damry, N. K. Goto, R. A. Chica, Nat. Chem. Biol. 13, 1280–1285 (2017)] or fold [P. A. Alexander, Y. He, Y. Chen, J. Orban, P. N. Bryan, Proc. Natl. Acad. Sci. U.S.A. 106, 21149–21154 (2009)], the de novo design of closely related sequences which adopt well-defined but structurally divergent structures remains an outstanding challenge. We designed closely related sequences (over 94% identity) that can adopt two very different homotrimeric helical bundle conformations—one short (∼66 Å height) and the other long (∼100 Å height)—reminiscent of the conformational transition of viral fusion proteins. Crystallographic and NMR spectroscopic characterization shows that both the short- and long-state sequences fold as designed. We sought to design bistable sequences for which both states are accessible, and obtained a single designed protein sequence that populates either the short state or the long state depending on the measurement conditions. The design of sequences which are poised to adopt two very different conformations sets the stage for creating large-scale conformational switches between structurally divergent forms.

Download Full-text