Multi-scale structural analysis of proteins by deep semantic segmentation

Abstract Motivation Recent advances in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation—a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structure quality assessment. Results We train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model achieves a high per-residue accuracy of 90.8% on the test set (95.0% average per-class accuracy; 87.8% average per-structure accuracy). We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design. Availability and implementation The trained classifier network, parser network, and entropy calculation scripts are available for download at https://git.io/fp6bd, with detailed usage instructions provided at the download page. A step-by-step tutorial for setup is provided at https://goo.gl/e8GB2S. All Rosetta commands, RosettaRemodel blueprints, and predictions for all datasets used in the study are available in the Supplementary Information. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Multi-Scale Structural Analysis of Proteins by Deep Semantic Segmentation

10.1101/474627 ◽

2018 ◽

Author(s):

Raphael R. Eguchi ◽

Po-Ssu Huang

Keyword(s):

Image Classification ◽

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Semantic Segmentation ◽

Amino Acid Sequences ◽

Structural Quality ◽

Small Subset ◽

Structural Prediction

AbstractRecent advancements in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds, and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation — a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structural quality assessment. We represent protein structures as 2D α-carbon distance matrices (“contact maps”), and train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model performs exceptionally well, achieving a per-residue accuracy of 90.8% on the test set (95.0% average accuracy over all classes; 87.8% average within-structure accuracy). The unique aspect of our classifier is that it encodes sequence agnostic residue environments from the PDB and can assess structural quality as quantitative probabilities. We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design.SignificanceRecent computational advances have allowed researchers to predict the structure of many proteins from their amino acid sequences, as well as designing new sequences that fold into predefined structures. However, these tasks are often challenging because they require selection of a small subset of promising structural models from a large pool of stochastically generated ones. Here, we describe a novel approach to protein model selection that uses 2D image classification techniques to evaluate 3D protein models. Our method can be used to select structures based on the fold that they adopt, and can also be used to identify regions of low structural quality. These capabilities yield a powerful tool for both protein design and structure prediction.

Download Full-text

Dissecting the stability determinants of a challenging de novo protein fold using massively parallel design and experimentation

10.1101/2021.12.17.472837 ◽

2021 ◽

Author(s):

Tae-Eun Kim ◽

Kotaro Tsuboyama ◽

Scott Houliston ◽

Cydney M. Martell ◽

Claire M. Phoumyvong ◽

...

Keyword(s):

Protein Design ◽

Large Scale ◽

De Novo ◽

Protein Structures ◽

Design Problems ◽

Hydrogen Deuterium ◽

The Stability ◽

Universal Stability ◽

Scale Design ◽

New Protein

Designing entirely new protein structures remains challenging because we do not fully understand the biophysical determinants of folding stability. Yet some protein folds are easier to design than others. Previous work identified the 43-residue αββ&#945 fold as especially challenging: the best designs had only a 2% success rate, compared to 39-87% success for other simple folds (1). This suggested the αββ&#945 fold would be a useful model system for gaining a deeper understanding of folding stability determinants and for testing new protein design methods. Here, we designed over ten thousand new αββ&#945 proteins and found over three thousand of them to fold into stable structures using a high-throughput protease-based assay. Nuclear magnetic resonance, hydrogen-deuterium exchange, circular dichroism, deep mutational scanning, and scrambled sequence control experiments indicated that our stable designs fold into their designed αββ&#945 structures with exceptional stability for their small size. Our large dataset enabled us to quantify the influence of universal stability determinants including nonpolar burial, helix capping, and buried unsatisfied polar atoms, as well as stability determinants unique to the αββ&#945 topology. Our work demonstrates how large-scale design and test cycles can solve challenging design problems while illuminating the biophysical determinants of folding.

Download Full-text

Quick and efficient approach to develop genomic resources in orphan species: Application in Lavandula angustifolia

PLoS ONE ◽

10.1371/journal.pone.0243853 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243853

Author(s):

Berline Fopa Fomeju ◽

Dominique Brunel ◽

Aurélie Bérard ◽

Jean-Baptiste Rivoal ◽

Philippe Gallois ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Genetic Distances ◽

Lavandula Angustifolia ◽

Distance Analysis ◽

Alternative Medicines ◽

Dna And Rna ◽

Snp Development ◽

High Level

Next-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors to generate genomic data in a range of previously poorly studied species. In this study, we propose a method for the rapid development of a large-scale molecular resources for orphan species. We studied as an example the true lavender (Lavandula angustifolia Mill.), a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines. The heterozygous clone “Maillette” was used as a reference for DNA and RNA sequencing. We first built a reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with introns and exons) using an Unigene-guided DNA-seq assembly approach. This aimed to maximize the possibilities of finding polymorphism between genetically close individuals despite the lack of a reference genome. Finally, we used these resources for SNP mining within a collection of 16 commercial lavender clones and tested the SNP within the scope of a genetic distance analysis. We obtained a cleaned reference of 8, 030 functionally in silico annotated genes. We found 359K polymorphic sites and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). On overall, we found similar genetic distances between pairs of clones, which is probably related to the out-crossing nature of the species and the restricted area of cultivation. The proposed method is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is also the first reported large-scale SNP development on Lavandula angustifolia. All the genomics resources developed herein are publicly available and provide a rich pool of molecular resources to explore and exploit lavender genetic diversity in breeding programs.

Download Full-text

Comparative genomics and pangenome-oriented studies reveal high homogeneity of the agronomically relevant enterobacterial plant pathogen Dickeya solani

10.21203/rs.3.rs-20034/v3 ◽

2020 ◽

Author(s):

Agata Motyka-Pomagruk ◽

Sabina Zoledowska ◽

Agnieszka Emilia Misztak ◽

Wojciech Sledz ◽

Alessio Mengoni ◽

...

Keyword(s):

Comparative Genomics ◽

Large Scale ◽

De Novo ◽

Genetic Material ◽

Soft Rot ◽

Potato Production ◽

Core Gene ◽

Ecological Niches ◽

Dickeya Solani ◽

High Level

Abstract Background: Dickeya solani is an important plant pathogenic bacterium causing severe losses in European potato production. This species draws a lot of attention due to its remarkable virulence, great devastating potential and easier spread in contrast to other Dickeya spp. In view of a high need for extensive studies on economically important soft rot Pectobacteriaceae , we performed a comparative genomics analysis on D. solani strains to search for genetic foundations that would explain the differences in the observed virulence levels within the D. solani population. Results: High quality assemblies of 8 de novo sequenced D. solani genomes have been obtained. Whole-sequence comparison, ANIb, ANIm, Tetra and pangenome-oriented analyses performed on these genomes and the sequences of 14 additional strains revealed an exceptionally high level of homogeneity among the studied genetic material of D. solani strains. With the use of 22 genomes, the pangenome of D. solani , comprising 84.7% core, 7.2% accessory and 8.1% unique genes, has been almost completely determined, suggesting the presence of a nearly closed pangenome structure. Attribution of the genes included in the D. solani pangenome fractions to functional COG categories showed that higher percentages of accessory and unique pangenome parts in contrast to the core section are encountered in phage/mobile elements- and transcription- associated groups with the genome of RNS 05.1.2A strain having the most significant impact. Also, the first D. solani large-scale genome-wide phylogeny computed on concatenated core gene alignments is herein reported. Conclusions: The almost closed status of D. solani pangenome achieved in this work points to the fact that the unique gene pool of this species should no longer expand. Such a feature is characteristic of taxa whose representatives either occupy isolated ecological niches or lack efficient mechanisms for gene exchange and recombination, which seems rational concerning a strictly pathogenic species with clonal population structure. Finally, no obvious correlations between the geographical origin of D. solani strains and their phylogeny were found, which might reflect the specificity of the international seed potato market.

Download Full-text

Protein designer David Baker: I like doing things that seem like magic

National Science Review ◽

10.1093/nsr/nwaa071 ◽

2020 ◽

Vol 7 (8) ◽

pp. 1410-1412

Author(s):

Weijie Zhao ◽

Chu Wang

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

Computational Prediction ◽

Biological Functions ◽

Personal Experiences ◽

De Novo Protein Design ◽

And Function ◽

The University ◽

Opening Up

Abstract Search ‘de novo protein design’ on Google and you will find the name David Baker in all results of the first page. Professor David Baker at the University of Washington and other scientists are opening up a new world of fantastic proteins. Protein is the direct executor of most biological functions and its structure and function are fully determined by its primary sequence. Baker's group developed the Rosetta software suite that enabled the computational prediction and design of protein structures. Being able to design proteins from scratch means being able to design executors for diverse purposes and benefit society in multiple ways. Recently, NSR interviewed Prof. Baker on this fast-developing field and his personal experiences.

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De novo protein design: how do we expand into the universe of possible protein structures?

Current Opinion in Structural Biology ◽

10.1016/j.sbi.2015.05.009 ◽

2015 ◽

Vol 33 ◽

pp. 16-26 ◽

Cited By ~ 110

Author(s):

Derek N Woolfson ◽

Gail J Bartlett ◽

Antony J Burton ◽

Jack W Heal ◽

Ai Niitsu ◽

...

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

De Novo Protein Design ◽

The Universe

Download Full-text

Computational protein design with backbone plasticity

Biochemical Society Transactions ◽

10.1042/bst20160155 ◽

2016 ◽

Vol 44 (5) ◽

pp. 1523-1529 ◽

Cited By ~ 13

Author(s):

James T. MacDonald ◽

Paul S. Freemont

Keyword(s):

Protein Design ◽

De Novo ◽

Protein Structures ◽

Search Space ◽

Computational Protein Design ◽

Artificial Enzymes ◽

Backbone Flexibility ◽

Artificial Proteins ◽

Naturally Occurring ◽

Backbone Structure

The computational algorithms used in the design of artificial proteins have become increasingly sophisticated in recent years, producing a series of remarkable successes. The most dramatic of these is the de novo design of artificial enzymes. The majority of these designs have reused naturally occurring protein structures as ‘scaffolds’ onto which novel functionality can be grafted without having to redesign the backbone structure. The incorporation of backbone flexibility into protein design is a much more computationally challenging problem due to the greatly increased search space, but promises to remove the limitations of reusing natural protein scaffolds. In this review, we outline the principles of computational protein design methods and discuss recent efforts to consider backbone plasticity in the design process.

Download Full-text

De Novo Protein Design for Novel Folds using Guided Conditional Wasserstein Generative Adversarial Networks (gcWGAN)

10.1101/769919 ◽

2019 ◽

Cited By ~ 4

Author(s):

Mostafa Karimi ◽

Shaowen Zhu ◽

Yue Cao ◽

Yang Shen

Keyword(s):

Protein Design ◽

Sequence Space ◽

De Novo ◽

Sequence Data ◽

Generative Models ◽

Current Data ◽

Data Driven ◽

Supplementary Information ◽

Generative Adversarial Networks ◽

Sequence Structure

AbstractMotivationFacing data quickly accumulating on protein sequence and structure, this study is addressing the following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?ResultsWe have developed novel deep generative models, constructed low-dimensional and generalizable representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervised gcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generate more yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE). Assessed with structure predictor over representative novel folds (including one not even part of basis folds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins by learning from current sequence-structure data. The ultra fast data-driven model can be a powerful addition to principle-driven design methods through generating seed designs or tailoring sequence space.AvailabilityData and source codes will be available upon [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A structural homology approach for computational protein design with flexible backbone

Bioinformatics ◽

10.1093/bioinformatics/bty975 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2418-2426 ◽

Cited By ~ 2

Author(s):

David Simoncini ◽

Kam Y J Zhang ◽

Thomas Schiex ◽

Sophie Barbe

Keyword(s):

Amino Acid ◽

Protein Design ◽

Protein Sequence ◽

Critical Role ◽

Protein Structures ◽

Amino Acid Sequences ◽

Computational Protein Design ◽

Supplementary Information ◽

Structural Homology ◽

Homologous Proteins

Abstract Motivation Structure-based Computational Protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. Energy functions remain however imperfect and injecting relevant information from known structures in the design process should lead to improved designs. Results We introduce Shades, a data-driven CPD method that exploits local structural environments in known protein structures together with energy to guide sequence design, while sampling side-chain and backbone conformations to accommodate mutations. Shades (Structural Homology Algorithm for protein DESign), is based on customized libraries of non-contiguous in-contact amino acid residue motifs. We have tested Shades on a public benchmark of 40 proteins selected from different protein families. When excluding homologous proteins, Shades achieved a protein sequence recovery of 30% and a protein sequence similarity of 46% on average, compared with the PFAM protein family of the target protein. When homologous structures were added, the wild-type sequence recovery rate achieved 93%. Availability and implementation Shades source code is available at https://bitbucket.org/satsumaimo/shades as a patch for Rosetta 3.8 with a curated protein structure database and ITEM library creation software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text