Protein loop modeling and refinement using deep learning models

Loops in proteins play essential roles in protein functions and interactions. The structural characterization of loops is challenging because of their conformational flexibility and relatively poor conservation in multiple sequence alignments. Many experimental and computational approaches have been carried out during the last few decades for loop modeling. Although the latest AlphaFold2 achieved remarkable performance in protein structure predictions, the accuracy of loop regions for many proteins still needs to be improved for downstream applications such as protein function prediction and structure based drug design. In this paper, we proposed two novel deep learning architectures for loop modeling: one uses a combined convolutional neural network (CNN)-recursive neural network (RNN) structure (DeepMUSICS) and the other is based on refinement of histograms using a 2D CNN architecture (DeepHisto). In each of the methods, two types of models, conformation sampling model and energy scoring model, were trained and applied in the loop folding process. Both methods achieved promising results and worth further investigations. Since multiple sequence alignments (MSA) were not used in our architecture, the energy scoring models have less bias from MSA. We believe the methods may serve as good complements for refining AlphaFold2 predicted structures.

Download Full-text

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

Scientific Reports ◽

10.1038/s41598-021-91827-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Farhan Quadir ◽

Raj S. Roy ◽

Randal Halfmann ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Tertiary Structure ◽

Protein Complexes ◽

Complex Structure ◽

Great Success ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Residue Contacts ◽

Evolutionary Features

AbstractDeep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

10.21203/rs.3.rs-584804/v1 ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Michael Bernhofer ◽

...

Keyword(s):

Protein Function ◽

Language Models ◽

Single Amino Acid ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins ◽

Entire Sequence ◽

Embedding Methods ◽

Better Than

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (LMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or marked amino acids from the context of entire sequence regions. Here, we explored how to benefit from learned protein LM representations (embeddings) to predict SAV effects. Although we have failed so far to predict SAV effects directly from embeddings, this input alone predicted residue conservation almost as accurately from single sequences as using multiple sequence alignments (MSAs) with a two-state per-residue accuracy (conserved/not) of Q2=80% (embeddings) vs. 81% (ConSeq). Considering all SAVs at all residue positions predicted as conserved to affect function reached 68.6% (Q2: effect/neutral; for PMD) without optimization, compared to an expert solution such as SNAP2 (Q2=69.8). Combining predicted conservation with BLOSUM62 to obtain variant-specific binary predictions, DMS experiments of four human proteins were predicted better than by SNAP2, and better than by applying the same simplistic approach to conservation taken from ConSeq. Thus, embedding methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. This allowed prediction of SAV effects for the entire human proteome (~20k proteins) within 17 minutes on a single GPU.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

Human Genetics ◽

10.1007/s00439-021-02411-y ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Kyra Erckert ◽

...

Keyword(s):

Protein Function ◽

Pearson Correlation ◽

Performance Measure ◽

Language Models ◽

Single Amino Acid ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Download Full-text

Accurate inference of tree topologies from multiple sequence alignments using deep learning

10.1101/559054 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anton Suvorov ◽

Joshua Hochuli ◽

Daniel R. Schrider

Keyword(s):

Deep Learning ◽

Parameter Space ◽

Phylogenetic Trees ◽

Strong Support ◽

Biological Research ◽

Learning Approaches ◽

Sequence Alignments ◽

Traditional Methods ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractReconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

Download Full-text

AttentiveDist: Protein Inter-Residue Distance Prediction Using Deep Learning with Attention on Quadruple Multiple Sequence Alignments

10.1101/2020.11.24.396770 ◽

2020 ◽

Author(s):

Aashish Jain ◽

Genki Terashi ◽

Yuki Kagaya ◽

Sai Raghavendra Maddhuri Venkata Subramaniya ◽

Charles Christoffer ◽

...

Keyword(s):

Deep Learning ◽

Structure Prediction ◽

Prediction Models ◽

3D Structure ◽

Evolutionary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

Distance Prediction

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.

Download Full-text

Faculty Opinions recommendation of Accurate inference of tree topologies from multiple sequence alignments using deep learning.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.736580131.793565148 ◽

2019 ◽

Author(s):

Tandy Warnow

Keyword(s):

Deep Learning ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Tree Topologies

Download Full-text

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning

Systematic Biology ◽

10.1093/sysbio/syz060 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anton Suvorov ◽

Joshua Hochuli ◽

Daniel R Schrider

Keyword(s):

Deep Learning ◽

Parameter Space ◽

Phylogenetic Trees ◽

Strong Support ◽

Biological Research ◽

Learning Approaches ◽

Sequence Alignments ◽

Traditional Methods ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Abstract Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

Download Full-text

Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction

Scientific Reports ◽

10.1038/s41598-021-87204-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Aashish Jain ◽

Genki Terashi ◽

Yuki Kagaya ◽

Sai Raghavendra Maddhuri Venkata Subramaniya ◽

Charles Christoffer ◽

...

Keyword(s):

Deep Learning ◽

Structure Prediction ◽

Tertiary Structure ◽

3D Structure ◽

Evolutionary Information ◽

Learning Approaches ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Novel Approach

AbstractProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. We show that combining four MSAs of different E-value cutoffs improved the model prediction performance as compared to single E-value MSA features. A further improvement was observed when an attention layer was used and even more when additional prediction tasks of bond angle predictions were added. The improvement of distance predictions were successfully transferred to achieve better protein tertiary structure modeling.

Download Full-text

Embeddings from protein language models predict conservation and variant effects

10.21203/rs.3.rs-584804/v3 ◽

2021 ◽

Author(s):

Céline Marquet ◽

Michael Heinzinger ◽

Tobias Olenyi ◽

Christian Dallago ◽

Michael Bernhofer ◽

...

Keyword(s):

Protein Function ◽

Pearson Correlation ◽

Performance Measure ◽

Language Models ◽

Single Amino Acid ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Human Proteins

Abstract The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Lastly, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~20k proteins) within 40 minutes on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Download Full-text

Predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

10.1101/2020.11.09.373878 ◽

2020 ◽

Author(s):

Farhan Quadir ◽

Raj Roy ◽

Randal Halfmann ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Tertiary Structure ◽

Protein Complexes ◽

Great Success ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Residue Contact ◽

Residue Contacts ◽

Evolutionary Features

AbstractDeep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of DNCON2: 22.9% for homodimers, and 17.0% for higher order homomultimers. In some instances, especially where interchain contact densities are high, the approach predicted interchain contacts with 100% precision. We show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Download Full-text