scholarly journals Rosetta design with co-evolutionary information retains protein function

2021 ◽  
Vol 17 (1) ◽  
pp. e1008568
Author(s):  
Samuel Schmitz ◽  
Moritz Ertelt ◽  
Rainer Merkl ◽  
Jens Meiler

Computational protein design has the ambitious goal of crafting novel proteins that address challenges in biology and medicine. To overcome these challenges, the computational protein modeling suite Rosetta has been tailored to address various protein design tasks. Recently, statistical methods have been developed that identify correlated mutations between residues in a multiple sequence alignment of homologous proteins. These subtle inter-dependencies in the occupancy of residue positions throughout evolution are crucial for protein function, but we found that three current Rosetta design approaches fail to recover these co-evolutionary couplings. Thus, we developed the Rosetta method ResCue (residue-coupling enhanced) that leverages co-evolutionary information to favor sequences which recapitulate correlated mutations, as observed in nature. To assess the protocols via recapitulation designs, we compiled a benchmark of ten proteins each represented by two, structurally diverse states. We could demonstrate that ResCue designed sequences with an average sequence recovery rate of 70%, whereas three other protocols reached not more than 50%, on average. Our approach had higher recovery rates also for functionally important residues, which were studied in detail. This improvement has only a minor negative effect on the fitness of the designed sequences as assessed by Rosetta energy. In conclusion, our findings support the idea that informing protocols with co-evolutionary signals helps to design stable and native-like proteins that are compatible with the different conformational states required for a complex function.

2017 ◽  
Author(s):  
Diego Javier Zea ◽  
Alexander Miguel Monzon ◽  
Gustavo Parisi ◽  
Cristina Marino-Buslje

AbstractConservation and covariation measures, as other evolutionary analysis, require a high number of distant homologous sequences, therefore a lot of structural divergence can be expected in such divergent alignments. However, most works linking evolutionary and structural information use a single structure ignoring the structural variability inside a protein family. That common practice seems unrealistic to the light of this work.In this work we studied how structural divergence affects conservation and covariation estimations. We uncover that, within a protein family, ~51% of multiple sequence alignment columns change their exposed/buried status between structures. Also, ~53% of residue pairs that are in contact in one structure are not in contact in another structure from the same family. We found out that residue conservation is not directly related to the relative solvent accessible surface area of a single protein structure. Using information from all the available structures rather than from a single representative structure gives more confidence in the structural interpretation of the evolutionary signals. That is particularly important for diverse multiple sequence alignments, where structures can drastically differ. High covariation scores tend to indicate residue contacts that are conserved in the family, therefore, are not suitable to find protein/conformer specific contacts.Our results suggest that structural divergence should be considered for a better understanding of protein function, to transfer annotation by homology and to model protein evolution.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jung-Eun Shin ◽  
Adam J. Riesselman ◽  
Aaron W. Kollasch ◽  
Conor McMahon ◽  
Elana Simon ◽  
...  

AbstractThe ability to design functional sequences and predict effects of variation is central to protein engineering and biotherapeutics. State-of-art computational methods rely on models that leverage evolutionary information but are inadequate for important applications where multiple sequence alignments are not robust. Such applications include the prediction of variant effects of indels, disordered proteins, and the design of proteins such as antibodies due to the highly variable complementarity determining regions. We introduce a deep generative model adapted from natural language processing for prediction and design of diverse functional sequences without the need for alignments. The model performs state-of-art prediction of missense and indel effects and we successfully design and test a diverse 105-nanobody library that shows better expression than a 1000-fold larger synthetic library. Our results demonstrate the power of the alignment-free autoregressive model in generalizing to regions of sequence space traditionally considered beyond the reach of prediction and design.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Maria Littmann ◽  
Michael Heinzinger ◽  
Christian Dallago ◽  
Tobias Olenyi ◽  
Burkhard Rost

AbstractKnowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Michael Bernhofer ◽  
...  

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (LMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or marked amino acids from the context of entire sequence regions. Here, we explored how to benefit from learned protein LM representations (embeddings) to predict SAV effects. Although we have failed so far to predict SAV effects directly from embeddings, this input alone predicted residue conservation almost as accurately from single sequences as using multiple sequence alignments (MSAs) with a two-state per-residue accuracy (conserved/not) of Q2=80% (embeddings) vs. 81% (ConSeq). Considering all SAVs at all residue positions predicted as conserved to affect function reached 68.6% (Q2: effect/neutral; for PMD) without optimization, compared to an expert solution such as SNAP2 (Q2=69.8). Combining predicted conservation with BLOSUM62 to obtain variant-specific binary predictions, DMS experiments of four human proteins were predicted better than by SNAP2, and better than by applying the same simplistic approach to conservation taken from ConSeq. Thus, embedding methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. This allowed prediction of SAV effects for the entire human proteome (~20k proteins) within 17 minutes on a single GPU.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Kyra Erckert ◽  
...  

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2018 ◽  
Vol 16 (02) ◽  
pp. 1840005 ◽  
Author(s):  
Dmitry Suplatov ◽  
Yana Sharapova ◽  
Daria Timonina ◽  
Kirill Kopylov ◽  
Vytas Švedas

The visualCMAT web-server was designed to assist experimental research in the fields of protein/enzyme biochemistry, protein engineering, and drug discovery by providing an intuitive and easy-to-use interface to the analysis of correlated mutations/co-evolving residues. Sequence and structural information describing homologous proteins are used to predict correlated substitutions by the Mutual information-based CMAT approach, classify them into spatially close co-evolving pairs, which either form a direct physical contact or interact with the same ligand (e.g. a substrate or a crystallographic water molecule), and long-range correlations, annotate and rank binding sites on the protein surface by the presence of statistically significant co-evolving positions. The results of the visualCMAT are organized for a convenient visual analysis and can be downloaded to a local computer as a content-rich all-in-one PyMol session file with multiple layers of annotation corresponding to bioinformatic, statistical and structural analyses of the predicted co-evolution, or further studied online using the built-in interactive analysis tools. The online interactivity is implemented in HTML5 and therefore neither plugins nor Java are required. The visualCMAT web-server is integrated with the Mustguseal web-server capable of constructing large structure-guided sequence alignments of protein families and superfamilies using all available information about their structures and sequences in public databases. The visualCMAT web-server can be used to understand the relationship between structure and function in proteins, implemented at selecting hotspots and compensatory mutations for rational design and directed evolution experiments to produce novel enzymes with improved properties, and employed at studying the mechanism of selective ligand’s binding and allosteric communication between topologically independent sites in protein structures. The web-server is freely available at https://biokinet.belozersky.msu.ru/visualcmat and there are no login requirements.


2022 ◽  
pp. 325-353
Author(s):  
María Carmen Carnero ◽  
Javier Cárcel-Carrasco

The number of studies that assess the level of maintenance in a country is still very small, despite the contribution of this area to national competitiveness. The literature analyses asset management based on key performance indicators, but not via a multicriteria model. This chapter describes a multicriteria model, constructed by means of the fuzzy analytic hierarchy process (FAHP). The weightings are converted into utility functions, allowing the final utility of an alternative to be calculated via a multi-attribute utility function. Data on the state of asset management in Spain, in 2005 and 2010, are used to produce discrete probability distributions. Finally, a Monte Carlo simulation is applied to estimate the uncertainty of a complex function. In this way, the level of excellence of asset management in small businesses in Spain, before and after the recession, could be determined. The results show that the economic crisis experienced in Spain since 2008 has had a negative effect on the level of asset management in most sectors.


Entropy ◽  
2019 ◽  
Vol 21 (8) ◽  
pp. 764 ◽  
Author(s):  
Eshel Faraggi ◽  
A. Keith Dunker ◽  
Robert L. Jernigan ◽  
Andrzej Kloczkowski

Entropy should directly reflect the extent of disorder in proteins. By clustering structurally related proteins and studying the multiple-sequence-alignment of the sequences of these clusters, we were able to link between sequence, structure, and disorder information. We introduced several parameters as measures of fluctuations at a given MSA site and used these as representative of the sequence and structure entropy at that site. In general, we found a tendency for negative correlations between disorder and structure, and significant positive correlations between disorder and the fluctuations in the system. We also found evidence for residue-type conservation for those residues proximate to potentially disordered sites. Mutation at the disorder site itself appear to be allowed. In addition, we found positive correlation for disorder and accessible surface area, validating that disordered residues occur in exposed regions of proteins. Finally, we also found that fluctuations in the dihedral angles at the original mutated residue and disorder are positively correlated while dihedral angle fluctuations in spatially proximal residues are negatively correlated with disorder. Our results seem to indicate permissible variability in the disordered site, but greater rigidity in the parts of the protein with which the disordered site interacts. This is another indication that disordered residues are involved in protein function.


2020 ◽  
Vol 48 (W1) ◽  
pp. W72-W76 ◽  
Author(s):  
Vadim M Gumerov ◽  
Igor B Zhulin

Abstract Key steps in a computational study of protein function involve analysis of (i) relationships between homologous proteins, (ii) protein domain architecture and (iii) gene neighborhoods the corresponding proteins are encoded in. Each of these steps requires a separate computational task and sets of tools. Currently in order to relate protein features and gene neighborhoods information to phylogeny, researchers need to prepare all the necessary data and combine them by hand, which is time-consuming and error-prone. Here, we present a new platform, TREND (tree-based exploration of neighborhoods and domains), which can perform all the necessary steps in automated fashion and put the derived information into phylogenomic context, thus making evolutionary based protein function analysis more efficient. A rich set of adjustable components allows a user to run the computational steps specific to his task. TREND is freely available at http://trend.zhulinlab.org.


Author(s):  
Carlos Eduardo Sequeiros-Borja ◽  
Bartłomiej Surpeta ◽  
Jan Brezovsky

Abstract Progress in technology and algorithms throughout the past decade has transformed the field of protein design and engineering. Computational approaches have become well-engrained in the processes of tailoring proteins for various biotechnological applications. Many tools and methods are developed and upgraded each year to satisfy the increasing demands and challenges of protein engineering. To help protein engineers and bioinformaticians navigate this emerging wave of dedicated software, we have critically evaluated recent additions to the toolbox regarding their application for semi-rational and rational protein engineering. These newly developed tools identify and prioritize hotspots and analyze the effects of mutations for a variety of properties, comprising ligand binding, protein–protein and protein–nucleic acid interactions, and electrostatic potential. We also discuss notable progress to target elusive protein dynamics and associated properties like ligand-transport processes and allosteric communication. Finally, we discuss several challenges these tools face and provide our perspectives on the further development of readily applicable methods to guide protein engineering efforts.


Sign in / Sign up

Export Citation Format

Share Document