SPOT-1D-LM: Reaching Alignment-profile-based Accuracy in Predicting Protein Secondary and Tertiary Structural Properties without Alignment.

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a combination of traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) allows a leap in accuracy over single-sequence based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers. This large improvement leads to an accuracy comparable to or better than the current state-of-the-art techniques for predicting these 1D structural properties based on sequence profiles generated from multiple sequence alignments. The high-accuracy prediction in both secondary and tertiary structural properties indicates that it is possible to make highly accurate prediction of protein structures without homologous sequences, the remaining obstacle in the post AlphaFold2 era.

Download Full-text

Distillation of MSA Embeddings to Folded Protein Structures with Graph Transformers

10.1101/2021.06.02.446809 ◽

2021 ◽

Author(s):

Allan Costa ◽

Manvitha Ponnapati ◽

Joseph M Jacobson ◽

Pranam Chatterjee

Keyword(s):

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Protein Sequences ◽

Language Models ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Folded Structures

Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.

Download Full-text

Characterization of Non-Trivial Neighborhood Fold Constraints from Protein Sequences using Generalized Topohydrophobicity.

Bioinformatics and Biology Insights ◽

10.4137/bbi.s426 ◽

2008 ◽

Vol 2 ◽

pp. BBI.S426 ◽

Cited By ~ 2

Author(s):

Guillaume Fourty ◽

Isabelle Callebaut ◽

Jean-Paul Mornon

Keyword(s):

Secondary Structure ◽

Solvent Accessibility ◽

Protein Structures ◽

Comparative Modeling ◽

Local Geometry ◽

Large Set ◽

Structural Constraints ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Prediction of key features of protein structures, such as secondary structure, solvent accessibility and number of contacts between residues, provides useful structural constraints for comparative modeling, fold recognition, ab-initio fold prediction and detection of remote relationships. In this study, we aim at characterizing the number of non-trivial close neighbors, or long-range contacts of a residue, as a function of its “topohydrophobic” index deduced from multiple sequence alignments and of the secondary structure in which it is embedded. The “topohydrophobic” index is calculated using a two-class distribution of amino acids, based on their mean atom depths. From a large set of structural alignments processed from the FSSP database, we selected 1485 structural sub-families including at least 8 members, with accurate alignments and limited redundancy. We show that residues within helices, even when deeply buried, have few non-trivial neighbors (0–2), whereas β-strand residues clearly exhibit a multimodal behavior, dominated by the local geometry of the tetrahedron (3 non-trivial close neighbors associated with one tetrahedron; 6 with two tetrahedra). This observed behavior allows the distinction, from sequence profiles, between edge and central β-strands within β-sheets. Useful topological constraints on the immediate neighborhood of an amino acid, but also on its correlated solvent accessibility, can thus be derived using this approach, from the simple knowledge of multiple sequence alignments.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

10.21203/rs.3.rs-584804/v1 ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Michael Bernhofer ◽

...

Keyword(s):

Protein Function ◽

Language Models ◽

Single Amino Acid ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins ◽

Entire Sequence ◽

Embedding Methods ◽

Better Than

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (LMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or marked amino acids from the context of entire sequence regions. Here, we explored how to benefit from learned protein LM representations (embeddings) to predict SAV effects. Although we have failed so far to predict SAV effects directly from embeddings, this input alone predicted residue conservation almost as accurately from single sequences as using multiple sequence alignments (MSAs) with a two-state per-residue accuracy (conserved/not) of Q2=80% (embeddings) vs. 81% (ConSeq). Considering all SAVs at all residue positions predicted as conserved to affect function reached 68.6% (Q2: effect/neutral; for PMD) without optimization, compared to an expert solution such as SNAP2 (Q2=69.8). Combining predicted conservation with BLOSUM62 to obtain variant-specific binary predictions, DMS experiments of four human proteins were predicted better than by SNAP2, and better than by applying the same simplistic approach to conservation taken from ConSeq. Thus, embedding methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. This allowed prediction of SAV effects for the entire human proteome (~20k proteins) within 17 minutes on a single GPU.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

Human Genetics ◽

10.1007/s00439-021-02411-y ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Kyra Erckert ◽

...

Keyword(s):

Protein Function ◽

Pearson Correlation ◽

Performance Measure ◽

Language Models ◽

Single Amino Acid ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Download Full-text

Fold and flexibility: what can proteins' mechanical properties tell us about their folding nucleus?

Journal of The Royal Society Interface ◽

10.1098/rsif.2015.0876 ◽

2015 ◽

Vol 12 (112) ◽

pp. 20150876 ◽

Cited By ~ 15

Author(s):

Sophie Sacquin-Mora

Keyword(s):

Mechanical Properties ◽

Brownian Dynamics ◽

Protein Structures ◽

Interaction Network ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Folding Nucleus ◽

Dynamics Simulations ◽

Brownian Dynamics Simulations

The determination of a protein's folding nucleus, i.e. a set of native contacts playing an important role during its folding process, remains an elusive yet essential problem in biochemistry. In this work, we investigate the mechanical properties of 70 protein structures belonging to 14 protein families presenting various folds using coarse-grain Brownian dynamics simulations. The resulting rigidity profiles combined with multiple sequence alignments show that a limited set of rigid residues, which we call the consensus nucleus, occupy conserved positions along the protein sequence. These residues' side chains form a tight interaction network within the protein's core, thus making our consensus nuclei potential folding nuclei. A review of experimental and theoretical literature shows that most (above 80%) of these residues were indeed identified as folding nucleus member in earlier studies.

Download Full-text

Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model

10.26434/chemrxiv.14604720 ◽

2021 ◽

Author(s):

Héléna Alexandra Gaspar ◽

Mohamed Ahmed ◽

Thomas Edlich ◽

Benedek Fabian ◽

Zsolt Varszegi ◽

...

Keyword(s):

Language Processing ◽

Language Model ◽

Predictive Performance ◽

Language Models ◽

Cytochrome P450 Enzymes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Combine Information ◽

Protein Datasets

<div>Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences. </div><div>Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.</div>

Download Full-text

Structural bioinformatics predicts that the Retinitis Pigmentosa-28 protein of unknown function FAM161A is a homologue of the microtubule nucleation factor Tpx2

F1000Research ◽

10.12688/f1000research.25870.1 ◽

2020 ◽

Vol 9 ◽

pp. 1052

Author(s):

Timothy P. Levine

Keyword(s):

Retinitis Pigmentosa ◽

Sequence Alignments ◽

Multiple Sequence ◽

Centrosomal Protein ◽

Multiple Sequence Alignments ◽

Identify Amino Acid ◽

Search Tool ◽

The One ◽

Sequence Profiles ◽

Small Elements

Background: FAM161A is a microtubule-associated protein conserved widely across eukaryotes, which is mutated in the inherited blinding disease Retinitis Pigmentosa-28. FAM161A is also a centrosomal protein, being a core component of a complex that forms an internal skeleton of centrioles. Despite these observations about the importance of FAM161A, current techniques used to examine its sequence reveal no homologies to other proteins. Methods: Sequence profiles derived from multiple sequence alignments of FAM161A homologues were constructed by PSI-BLAST and HHblits, and then used by the profile-profile search tool HHsearch, implemented online as HHpred, to identify homologues. These in turn were used to create profiles for reverse searches and pair-wise searches. Multiple sequence alignments were also used to identify amino acid usage in functional elements. Results: FAM161A has a single homologue: the targeting protein for Xenopus kinesin-like protein-2 (Tpx2), which is a strong hit across more than 200 residues. Tpx2 is also a microtubule-associated protein, and it has been shown previously by a cryo-EM molecular structure to nucleate microtubules through two small elements: an extended loop and a short helix. The homology between FAM161A and Tpx2 includes these elements, as FAM161A has three copies of the loop, and one helix that has many, but not all, properties of the one in Tpx2. Conclusions: FAM161A and its homologues are predicted to be a previously unknown variant of Tpx2, and hence bind microtubules in the same way. This prediction allows precise, testable molecular models to be made of FAM161A-microtubule complexes.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

10.21203/rs.3.rs-584804/v3 ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Michael Bernhofer ◽

...

Keyword(s):

Protein Function ◽

Pearson Correlation ◽

Performance Measure ◽

Language Models ◽

Single Amino Acid ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Lastly, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~20k proteins) within 40 minutes on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Download Full-text

Refined template selection and combination algorithm significantly improves template-based modeling accuracy

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019500069 ◽

2019 ◽

Vol 17 (02) ◽

pp. 1950006 ◽

Cited By ~ 4

Author(s):

Ashish Runthala ◽

Shibasish Chowdhury

Keyword(s):

Structural Information ◽

Protein Structures ◽

Comparative Modeling ◽

Target Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Modeling Accuracy ◽

Model Protein ◽

Rank And Select

In contrast to ab-initio protein modeling methodologies, comparative modeling is considered as the most popular and reliable algorithm to model protein structure. However, the selection of the best set of templates is still a major challenge. An effective template-ranking algorithm is developed to efficiently select only the reliable hits for predicting the protein structures. The algorithm employs the pairwise as well as multiple sequence alignments of template hits to rank and select the best possible set of templates. It captures several key sequences and structural information of template hits and converts into scores to effectively rank them. This selected set of templates is used to model a target. Modeling accuracy of the algorithm is tested and evaluated on TBM-HA domain containing CASP8, CASP9 and CASP10 targets. On an average, this template ranking and selection algorithm improves GDT-TS, GDT-HA and TM_Score by 3.531, 4.814 and 0.022, respectively. Further, it has been shown that the inclusion of structurally similar templates with ample conformational diversity is crucial for the modeling algorithm to maximally as well as reliably span the target sequence and construct its near-native model. The optimal model sampling also holds the key to predict the best possible target structure.

Download Full-text

Protein residues determining interaction specificity in paralogous families

Bioinformatics ◽

10.1093/bioinformatics/btaa934 ◽

2020 ◽

Author(s):

Borja Pitarch ◽

Juan A G Ranea ◽

Florencio Pazos

Keyword(s):

Amino Acid ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Information ◽

Large Set ◽

Supplementary Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Protein Residues

Abstract Motivation Predicting the residues controlling a protein’s interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. Results In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent ‘unsupervised’ method that does not use interactome information. Availability and implementation http://csbg.cnb.csic.es/pazos/Xdet/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text