scholarly journals Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model

Author(s):  
Héléna Alexandra Gaspar ◽  
Mohamed Ahmed ◽  
Thomas Edlich ◽  
Benedek Fabian ◽  
Zsolt Varszegi ◽  
...  

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>

2021 ◽  
Author(s):  
Héléna Alexandra Gaspar ◽  
Mohamed Ahmed ◽  
Thomas Edlich ◽  
Benedek Fabian ◽  
Zsolt Varszegi ◽  
...  

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Michael Bernhofer ◽  
...  

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (LMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or marked amino acids from the context of entire sequence regions. Here, we explored how to benefit from learned protein LM representations (embeddings) to predict SAV effects. Although we have failed so far to predict SAV effects directly from embeddings, this input alone predicted residue conservation almost as accurately from single sequences as using multiple sequence alignments (MSAs) with a two-state per-residue accuracy (conserved/not) of Q2=80% (embeddings) vs. 81% (ConSeq). Considering all SAVs at all residue positions predicted as conserved to affect function reached 68.6% (Q2: effect/neutral; for PMD) without optimization, compared to an expert solution such as SNAP2 (Q2=69.8). Combining predicted conservation with BLOSUM62 to obtain variant-specific binary predictions, DMS experiments of four human proteins were predicted better than by SNAP2, and better than by applying the same simplistic approach to conservation taken from ConSeq. Thus, embedding methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. This allowed prediction of SAV effects for the entire human proteome (~20k proteins) within 17 minutes on a single GPU.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Kyra Erckert ◽  
...  

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2021 ◽  
Author(s):  
Jaspreet Singh ◽  
Kuldip Paliwal ◽  
Jaswinder Singh ◽  
Yaoqi Zhou

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a combination of traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) allows a leap in accuracy over single-sequence based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers. This large improvement leads to an accuracy comparable to or better than the current state-of-the-art techniques for predicting these 1D structural properties based on sequence profiles generated from multiple sequence alignments. The high-accuracy prediction in both secondary and tertiary structural properties indicates that it is possible to make highly accurate prediction of protein structures without homologous sequences, the remaining obstacle in the post AlphaFold2 era.


2021 ◽  
Author(s):  
Konstantin Weissenow ◽  
Michael Heinzinger ◽  
Burkhard Rost

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.


2021 ◽  
Author(s):  
Allan Costa ◽  
Manvitha Ponnapati ◽  
Joseph M Jacobson ◽  
Pranam Chatterjee

Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jung-Eun Shin ◽  
Adam J. Riesselman ◽  
Aaron W. Kollasch ◽  
Conor McMahon ◽  
Elana Simon ◽  
...  

AbstractThe ability to design functional sequences and predict effects of variation is central to protein engineering and biotherapeutics. State-of-art computational methods rely on models that leverage evolutionary information but are inadequate for important applications where multiple sequence alignments are not robust. Such applications include the prediction of variant effects of indels, disordered proteins, and the design of proteins such as antibodies due to the highly variable complementarity determining regions. We introduce a deep generative model adapted from natural language processing for prediction and design of diverse functional sequences without the need for alignments. The model performs state-of-art prediction of missense and indel effects and we successfully design and test a diverse 105-nanobody library that shows better expression than a 1000-fold larger synthetic library. Our results demonstrate the power of the alignment-free autoregressive model in generalizing to regions of sequence space traditionally considered beyond the reach of prediction and design.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Michael Bernhofer ◽  
...  

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Lastly, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~20k proteins) within 40 minutes on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2021 ◽  
Author(s):  
Ratul Chowdhury ◽  
Nazim Bouatta ◽  
Surojit Biswas ◽  
Charlotte Rochereau ◽  
George M Church ◽  
...  

AlphaFold2 and related systems use deep learning to predict protein structure from co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite dramatic, recent increases in accuracy, three challenges remain: (i) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated, (ii) rapid exploration of designed structures, and (iii) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) able to predict protein structure from single protein sequences without use of MSAs. This deep learning system has two novel elements: a protein language model (AminoBERT) that uses a Transformer to learn latent structural information from millions of unaligned proteins and a geometric module that compactly represents Cα backbone geometry. RGN2 outperforms AlphaFold2 and RoseTTAFold (as well as trRosetta) on orphan proteins and is competitive with designed sequences, while achieving up to a billion-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.


Sign in / Sign up

Export Citation Format

Share Document