rawMSA: End-to-end Deep Learning Makes Protein Sequence Profiles and Feature Extraction obsolete

AbstractIn the last few decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and about their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed in the hope of taking advantage of these new architectures. On the other hand, most methods are still based on heavy pre-processing of the input data, as well as the extraction and integration of multiple hand-picked, manually designed features. Since Multiple Sequence Alignments (MSA) are almost always the main source of information in de novo prediction methods, it should be possible to develop Deep Networks to automatically refine the data and extract useful features from it. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering sequence profiles and other pre-calculated features obsolete. We developed rawMSA in three different flavors to predict secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on a par with the top ranked CASP12 methods in the inter-residue contact map prediction category. We believe that rawMSA represents a promising, more powerful approach to protein structure prediction that could replace older methods based on protein profiles in the coming years.Availabilitydatasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa

Download Full-text

Characterization of Non-Trivial Neighborhood Fold Constraints from Protein Sequences using Generalized Topohydrophobicity.

Bioinformatics and Biology Insights ◽

10.4137/bbi.s426 ◽

2008 ◽

Vol 2 ◽

pp. BBI.S426 ◽

Cited By ~ 2

Author(s):

Guillaume Fourty ◽

Isabelle Callebaut ◽

Jean-Paul Mornon

Keyword(s):

Secondary Structure ◽

Solvent Accessibility ◽

Protein Structures ◽

Comparative Modeling ◽

Local Geometry ◽

Large Set ◽

Structural Constraints ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Prediction of key features of protein structures, such as secondary structure, solvent accessibility and number of contacts between residues, provides useful structural constraints for comparative modeling, fold recognition, ab-initio fold prediction and detection of remote relationships. In this study, we aim at characterizing the number of non-trivial close neighbors, or long-range contacts of a residue, as a function of its “topohydrophobic” index deduced from multiple sequence alignments and of the secondary structure in which it is embedded. The “topohydrophobic” index is calculated using a two-class distribution of amino acids, based on their mean atom depths. From a large set of structural alignments processed from the FSSP database, we selected 1485 structural sub-families including at least 8 members, with accurate alignments and limited redundancy. We show that residues within helices, even when deeply buried, have few non-trivial neighbors (0–2), whereas β-strand residues clearly exhibit a multimodal behavior, dominated by the local geometry of the tetrahedron (3 non-trivial close neighbors associated with one tetrahedron; 6 with two tetrahedra). This observed behavior allows the distinction, from sequence profiles, between edge and central β-strands within β-sheets. Useful topological constraints on the immediate neighborhood of an amino acid, but also on its correlated solvent accessibility, can thus be derived using this approach, from the simple knowledge of multiple sequence alignments.

Download Full-text

SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity

Bioinformatics ◽

10.1093/bioinformatics/btu352 ◽

2014 ◽

Vol 30 (18) ◽

pp. 2592-2597 ◽

Cited By ~ 188

Author(s):

C. N. Magnan ◽

P. Baldi

Keyword(s):

Machine Learning ◽

Secondary Structure ◽

Solvent Accessibility ◽

Protein Secondary Structure ◽

Structural Similarity ◽

Relative Solvent Accessibility

Download Full-text

Mapping the glycosyltransferase fold landscape using interpretable deep learning

Nature Communications ◽

10.1038/s41467-021-25975-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Rahil Taujale ◽

Zhongliang Zhou ◽

Wayland Yeung ◽

Kelley W. Moremen ◽

Sheng Li ◽

...

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Structural Features ◽

Functional Diversification ◽

Sequence Structure ◽

Cellular Processes ◽

And Function ◽

Deep Learning Model ◽

Fold Prediction ◽

Primary Sequence Alignment

AbstractGlycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

Download Full-text

Brewery: deep learning and deeper profiles for the prediction of 1D protein structure annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa204 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3897-3898

Author(s):

Mirko Torrisi ◽

Gianluca Pollastri

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Solvent Accessibility ◽

Protein Structures ◽

Evolutionary Information ◽

Relative Solvent Accessibility ◽

Multiple Sources ◽

Contact Density ◽

And Training ◽

The Web

Abstract Motivation Protein structural annotations (PSAs) are essential abstractions to deal with the prediction of protein structures. Many increasingly sophisticated PSAs have been devised in the last few decades. However, the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict, such as novel folds. Results We propose Brewery, a suite of ab initio predictors of 1D PSAs. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of secondary structure, structural motifs, relative solvent accessibility and contact density. Availability and implementation The web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/. Contact [email protected]

Download Full-text

BiRDS - Binding Residue Detection from Protein Sequences using Deep ResNets

10.33774/chemrxiv-2021-013gn-v2 ◽

2021 ◽

Author(s):

Vineeth Chelur ◽

U. Deva Priyakumar

Keyword(s):

Binding Site ◽

Binding Sites ◽

Tertiary Structure ◽

Solvent Accessibility ◽

Three Dimensional ◽

Dimensional Structure ◽

Relative Solvent Accessibility ◽

Single Chain ◽

Sequence Alignments ◽

Multiple Sequence

Protein-drug interactions play important roles in many biological processes and therapeutics. Prediction of the active binding site of a protein helps discover and optimise these interactions leading to the design of better ligand molecules. The tertiary structure of a protein determines the binding sites available to the drug molecule. A quick and accurate prediction of the binding site from sequence alone without utilising the three-dimensional structure is challenging. Deep Learning has been used in a variety of biochemical tasks and has been hugely successful. In this paper, a Residual Neural Network (leveraging skip connections) is implemented to predict a protein's most active binding site. An Annotated Database of Druggable Binding Sites from the Protein DataBank, sc-PDB, is used for training the network. Features extracted from the Multiple Sequence Alignments (MSAs) of the protein generated using DeepMSA, such as Position-Specific Scoring Matrix (PSSM), Secondary Structure (SS3), and Relative Solvent Accessibility (RSA), are provided as input to the network. A weighted binary cross-entropy loss function is used to counter the substantial imbalance in the two classes of binding and non-binding residues. The network performs very well on single-chain proteins, providing a pocket that has good interactions with a ligand.

Download Full-text

Improved computational methods of protein sequence alignment, model selection and tertiary structure prediction

10.32469/10355/46126 ◽

2013 ◽

Author(s):

◽

Xin Deng

Keyword(s):

Protein Structure ◽

Secondary Structure ◽

Model Selection ◽

Sequence Alignment ◽

Protein Sequence ◽

Structure Prediction ◽

Tertiary Structure ◽

Solvent Accessibility ◽

Relative Solvent Accessibility ◽

Tertiary Structure Prediction

Protein sequence and profile alignment has been used essentially in most bioinformatics tasks such as protein structure modeling, function prediction, and phylogenetic analysis. We designed a new algorithm MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into multiple protein sequence alignment. Our experiments showed that it improved multiple sequence alignment accuracy over most existing methods without using the structural information and performed comparably to the method using structural features and additional homologous sequences by slightly lower scores. We also developed HHpacom, a new profile-profile pairwise alignment by integrating secondary structure, solvent accessibility, torsion angle and inferred residue pair coupling information. The evaluation showed that the secondary structure, relative solvent accessibility and torsion angle information significantly improved the alignment accuracy in comparison with the state of the art methods HHsearch and HHsuite. The evolutionary constraint information did help in some cases, especially the alignments of the proteins which are of short lengths, typically 100 to 500 residues. Protein Model selection is also a key step in protein tertiary structure prediction. We developed two SVM model quality assessment methods taking query-template alignment as input. The assessment results illustrated that this could help improve the model selection, protein structure prediction and many other bioinformatics problems. Moreover, we also developed a protein tertiary structure prediction pipeline, of which many components were built in our groupâ€™s MULTICOM system. The MULTICOM performed well in the CASP10 (Critical Assessment of Techniques for Protein Structure Prediction) competition.

Download Full-text

SPOT-1D-LM: Reaching Alignment-profile-based Accuracy in Predicting Protein Secondary and Tertiary Structural Properties without Alignment.

10.1101/2021.10.16.464622 ◽

2021 ◽

Author(s):

Jaspreet Singh ◽

Kuldip Paliwal ◽

Jaswinder Singh ◽

Yaoqi Zhou

Keyword(s):

Structural Properties ◽

Solvent Accessibility ◽

Protein Structures ◽

Language Models ◽

Sequence Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Structural And Functional Properties ◽

Sequence Profiles

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a combination of traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) allows a leap in accuracy over single-sequence based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers. This large improvement leads to an accuracy comparable to or better than the current state-of-the-art techniques for predicting these 1D structural properties based on sequence profiles generated from multiple sequence alignments. The high-accuracy prediction in both secondary and tertiary structural properties indicates that it is possible to make highly accurate prediction of protein structures without homologous sequences, the remaining obstacle in the post AlphaFold2 era.

Download Full-text

Logistic regression models to predict solvent accessible residues using sequence- and homology-based qualitative and quantitative descriptors applied to a domain-complete X-ray structure learning set

Journal of Applied Crystallography ◽

10.1107/s1600576715018531 ◽

2015 ◽

Vol 48 (6) ◽

pp. 1976-1984 ◽

Cited By ~ 2

Author(s):

Reecha Nepal ◽

Joanna Spencer ◽

Guneet Bhogal ◽

Amulya Nedunuri ◽

Thomas Poelman ◽

...

Keyword(s):

Logistic Regression ◽

Regression Models ◽

Structure Learning ◽

Solvent Accessibility ◽

Amino Acid Type ◽

Relative Solvent Accessibility ◽

Sequence Alignments ◽

Logistic Regression Models ◽

Learning Set ◽

Computationally Intensive

A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov–Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.79% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications.

Download Full-text

Quad-PRE: A Hybrid Method to Predict Protein Quaternary Structure Attributes

Computational and Mathematical Methods in Medicine ◽

10.1155/2014/715494 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Yajun Sheng ◽

Xingye Qiu ◽

Chen Zhang ◽

Jun Xu ◽

Yanping Zhang ◽

...

Keyword(s):

Amino Acid ◽

Secondary Structure ◽

Hybrid Method ◽

Quaternary Structure ◽

Biological Process ◽

Solvent Accessibility ◽

Empirical Evaluation ◽

Relative Solvent Accessibility ◽

Independent Dataset ◽

Scoring Matrix

The protein quaternary structure is very important to the biological process. Predicting their attributes is an essential task in computational biology for the advancement of the proteomics. However, the existing methods did not consider sufficient properties of amino acid. To end this, we proposed a hybrid method Quad-PRE to predict protein quaternary structure attributes using the properties of amino acid, predicted secondary structure, predicted relative solvent accessibility, and position-specific scoring matrix profiles and motifs. Empirical evaluation on independent dataset shows that Quad-PRE achieved higher overall accuracy 81.7%, especially higher accuracy 92.8%, 93.3%, and 90.6% on discrimination for trimer, hexamer, and octamer, respectively. Our model also reveals that six features sets are all important to the prediction, and a hybrid method is an optimal strategy by now. The results indicate that the proposed method can classify protein quaternary structure attributes effectively.

Download Full-text

Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility

Bioinformatics ◽

10.1093/bioinformatics/btt344 ◽

2013 ◽

Vol 29 (16) ◽

pp. 2056-2058 ◽

Cited By ~ 61

Author(s):

C. Mirabello ◽

G. Pollastri

Keyword(s):

Secondary Structure ◽

Solvent Accessibility ◽

Protein Secondary Structure ◽

High Accuracy ◽

Relative Solvent Accessibility

Download Full-text