Combining statistical and neural network approaches to derive energy functions for completely flexible protein backbone design

AbstractA designable protein backbone is one for which amino acid sequences that stably fold into it exist. To design such backbones, a general method is much needed for continuous sampling and optimization in the backbone conformational space without specific amino acid sequence information. The energy functions driving such sampling and optimization must faithfully recapitulate the characteristically coupled distributions of multiplexes of local and non-local conformational variables in designable backbones. It is also desired that the energy surfaces are continuous and smooth, with easily computable gradients. We combine statistical and neural network (NN) approaches to derive a model named SCUBA, standing for Side-Chain-Unspecialized-Backbone-Arrangement. In this approach, high-dimensional statistical energy surfaces learned from known protein structures are analytically represented as NNs. SCUBA is composed as a sum of NN terms describing local and non-local conformational energies, each NN term derived by first estimating the statistical energies in the corresponding multi-variable space via neighbor-counting (NC) with adaptive cutoffs, and then training the NN with the NC-estimated energies. To determine the relative weights of different energy terms, SCUBA-driven stochastic dynamics (SD) simulations of natural proteins are considered. As initial computational tests of SCUBA, we apply SD simulated annealing to automatically optimize artificially constructed polypeptide backbones of different fold classes. For a majority of the resulting backbones, structurally matching native backbones can be found with Dali Z-scores above 6 and less than 2 Å displacements of main chain atoms in aligned secondary structures. The results suggest that SCUBA-driven sampling and optimization can be a general tool for protein backbone design with complete conformational flexibility. In addition, the NC-NN approach can be generally applied to develop continuous, noise-filtered multi-variable statistical models from structural data.Linux executables to setup and run SCUBA SD simulations are publicly available (http://biocomp.ustc.edu.cn/servers/download_scuba.php). Interested readers may contact the authors for source code availability.

Download Full-text

Enumeration and comprehensive in-silico modeling of three-helix bundle structures composed of typical αα-hairpins

BMC Bioinformatics ◽

10.1186/s12859-021-04380-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Koya Sakuma ◽

Shintaro Minami

Keyword(s):

Amino Acid ◽

Protein Structures ◽

Amino Acid Sequences ◽

Conformational Space ◽

Single Chain ◽

Helix Bundle ◽

Building Simulations ◽

In Silico Modeling ◽

Foldable Structures ◽

Α Helix

Abstract Background The design of protein structures from scratch requires special attention to the combination of the types and lengths of the secondary structures and the loops required to build highly designable backbone structure models. However, it is difficult to predict the combinations that result in globular and protein-like conformations without simulations. In this study, we used single-chain three-helix bundles as simple models of protein tertiary structures and sought to thoroughly investigate the conditions required to construct them, starting from the identification of the typical αα-hairpin motifs. Results First, by statistical analysis of naturally occurring protein structures, we identified three αα-hairpins motifs that were specifically related to the left- and right-handedness of helix-helix packing. Second, specifying these αα-hairpins motifs as junctions, we performed sequence-independent backbone-building simulations to comparatively build single-chain three-helix bundle structures and identified the promising combinations of the length of the α-helix and αα-hairpins types that results in tight packing between the first and third α-helices. Third, using those single-chain three-helix bundle backbone structures as template structures, we designed amino acid sequences that were predicted to fold into the target topologies, which supports that the compact single-chain three-helix bundles structures that we sampled show sufficient quality to allow amino-acid sequence design. Conclusion The enumeration of the dominant subsets of possible backbone structures for small single-chain three-helical bundle topologies revealed that the compact foldable structures are discontinuously and sparsely distributed in the conformational space. Additionally, although the designs have not been experimentally validated in the present research, the comprehensive set of computational structural models generated also offers protein designers the opportunity to skip building similar structures by themselves and enables them to quickly focus on building specialized designs using the prebuilt structure models. The backbone and best design models in this study are publicly accessible from the following URL: https://doi.org/10.5281/zenodo.4321632.

Download Full-text

Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network

Frontiers in Genetics ◽

10.3389/fgene.2021.759384 ◽

2021 ◽

Vol 12 ◽

Author(s):

Rahu Sikander ◽

Yuping Wang ◽

Ali Ghulam ◽

Xianjuan Wu

Keyword(s):

Neural Network ◽

Amino Acid ◽

Convolutional Neural Network ◽

Cross Validation ◽

Amino Acid Sequences ◽

Protein Domain ◽

Biological Information ◽

Sequence Information ◽

Accuracy Score ◽

Feature Maps

Predicting the protein sequence information of enzymes and non-enzymes is an important but a very challenging task. Existing methods use protein geometric structures only or protein sequences alone to predict enzymatic functions. Thus, their prediction results are unsatisfactory. In this paper, we propose a novel approach for predicting the amino acid sequences of enzymes and non-enzymes via Convolutional Neural Network (CNN). In CNN, the roles of enzymes are predicted from multiple sides of biological information, including information on sequences and structures. We propose the use of two-dimensional data via 2DCNN to predict the proteins of enzymes and non-enzymes by using the same fivefold cross-validation function. We also use an independent dataset to test the performance of our model, and the results demonstrate that we are able to solve the overfitting problem. We used the CNN model proposed herein to demonstrate the superiority of our model for classifying an entire set of filters, such as 32, 64, and 128 parameters, with the fivefold validation test set as the independent classification. Via the Dipeptide Deviation from Expected Mean (DDE) matrix, mutation information is extracted from amino acid sequences and structural information with the distance and angle of amino acids is conveyed. The derived feature maps are then encoded in DDE exploitation. The independent datasets are then compared with other two methods, namely, GRU and XGBOOST. All analyses were conducted using 32, 64 and 128 filters on our proposed CNN method. The cross-validation datasets achieved an accuracy score of 0.8762%, whereas the accuracy of independent datasets was 0.7621%. Additional variables were derived on the basis of ROC AUC with fivefold cross-validation was achieved score is 0.95%. The performance of our model and that of other models in terms of sensitivity (0.9028%) and specificity (0.8497%) was compared. The overall accuracy of our model was 0.9133% compared with 0.8310% for the other model.

Download Full-text

A Deep Learning Approach for Predicting Antigenic Variation of Influenza A H3N2

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/9997669 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yuan-Ling Xia ◽

Weihua Li ◽

Yongping Li ◽

Xing-Lai Ji ◽

Yun-Xin Fu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Amino Acid ◽

Antigenic Variation ◽

Amino Acid Sequences ◽

Validation Dataset ◽

Sequence Information ◽

Learning Approach ◽

Long Distance ◽

Flu Virus

Modeling antigenic variation in influenza (flu) virus A H3N2 using amino acid sequences is a promising approach for improving the prediction accuracy of immune efficacy of vaccines and increasing the efficiency of vaccine screening. Antigenic drift and antigenic jump/shift, which arise from the accumulation of mutations with small or moderate effects and from a major, abrupt change with large effects on the surface antigen hemagglutinin (HA), respectively, are two types of antigenic variation that facilitate immune evasion of flu virus A and make it challenging to predict the antigenic properties of new viral strains. Despite considerable progress in modeling antigenic variation based on the amino acid sequences, few studies focus on the deep learning framework which could be most suitable to be applied to this task. Here, we propose a novel deep learning approach that incorporates a convolutional neural network (CNN) and bidirectional long-short-term memory (BLSTM) neural network to predict antigenic variation. In this approach, CNN extracts the complex local contexts of amino acids while the BLSTM neural network captures the long-distance sequence information. When compared to the existing methods, our deep learning approach achieves the overall highest prediction performance on the validation dataset, and more encouragingly, it achieves prediction agreements of 99.20% and 96.46% for the strains in the forthcoming year and in the next two years included in an existing set of chronological amino acid sequences, respectively. These results indicate that our deep learning approach is promising to be applied to antigenic variation prediction of flu virus A H3N2.

Download Full-text

Predicting secondary structures, contact numbers, and residue-wise contact orders of native protein structures from amino acid sequences using critical random networks

BIOPHYSICS ◽

10.2142/biophysics.1.67 ◽

2005 ◽

Vol 1 ◽

pp. 67-74 ◽

Cited By ~ 14

Author(s):

Akira R. Kinjo ◽

Ken Nishikawa

Keyword(s):

Amino Acid ◽

Protein Structures ◽

Secondary Structures ◽

Amino Acid Sequences ◽

Random Networks ◽

Native Protein

Download Full-text

In-silicoprediction and modeling of theEntamoeba histolyticaproteins: Serine-richEntamoeba histolyticaprotein and 29 kDa Cysteine-rich protease

PeerJ ◽

10.7717/peerj.3160 ◽

2017 ◽

Vol 5 ◽

pp. e3160 ◽

Cited By ~ 5

Author(s):

Kumar Manochitra ◽

Subhash Chandra Parija

Keyword(s):

Amino Acid ◽

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Amino Acid Sequences ◽

Treatment Modalities ◽

Bioinformatic Tools ◽

Complex Protein ◽

A Cell ◽

Quaternary Structures

BackgroundAmoebiasis is the third most common parasitic cause of morbidity and mortality, particularly in countries with poor hygienic settings. There exists an ambiguity in the diagnosis of amoebiasis, and hence there arises a necessity for a better diagnostic approach. Serine-richEntamoeba histolyticaprotein (SREHP), peroxiredoxin and Gal/GalNAc lectin are pivotal inE. histolyticavirulence and are extensively studied as diagnostic and vaccine targets. For elucidating the cellular function of these proteins, details regarding their respective quaternary structures are essential. However, studies in this aspect are scant. Hence, this study was carried out to predict the structure of these target proteins and characterize them structurally as well as functionally using appropriatein-silicomethods.MethodsThe amino acid sequences of the proteins were retrieved from National Centre for Biotechnology Information database and aligned using ClustalW. Bioinformatic tools were employed in the secondary structure and tertiary structure prediction. The predicted structure was validated, and final refinement was carried out.ResultsThe protein structures predicted by i-TASSER were found to be more accurate than Phyre2 based on the validation using SAVES server. The prediction suggests SREHP to be an extracellular protein, peroxiredoxin a peripheral membrane protein while Gal/GalNAc lectin was found to be a cell-wall protein. Signal peptides were found in the amino-acid sequences of SREHP and Gal/GalNAc lectin, whereas they were not present in the peroxiredoxin sequence. Gal/GalNAc lectin showed better antigenicity than the other two proteins studied. All the three proteins exhibited similarity in their structures and were mostly composed of loops.DiscussionThe structures of SREHP and peroxiredoxin were predicted successfully, while the structure of Gal/GalNAc lectin could not be predicted as it was a complex protein composed of sub-units. Also, this protein showed less similarity with the available structural homologs. The quaternary structures of SREHP and peroxiredoxin predicted from this study would provide better structural and functional insights into these proteins and may aid in development of newer diagnostic assays or enhancement of the available treatment modalities.

Download Full-text

Sars-Cov-2 Spike protein function prediction using a convolutional neural network ensemble

Design Engineering ◽

10.17762/de.vi.4293 ◽

2021 ◽

pp. 7831-7845

Author(s):

Raghad Monther Eid, Eman K. Elsayed, Fatma T. Ghanam

Keyword(s):

Neural Network ◽

Amino Acid ◽

Protein Function ◽

Protein Function Prediction ◽

Small Error ◽

Amino Acid Sequences ◽

Spike Protein ◽

Neural Network Ensemble ◽

Classification Problems ◽

Past Experiences

Introduction: SARS-CoV-2 has become a worldwide pandemic that affects all aspects of life; therefore, numerous organizations and open exploration foundations focus their efforts on research for viable therapeutics. Given past experiences and involvement in SARS, the essential focus has been the Spike protein, considered as the perfect objective for COVID-19 immunotherapies. Most of the vaccines being developed target the spike proteins because this protein covers the virus and helps it invade human cells. Methods: Applications of deep neural network is a quickly expanding field now reaching many areas including proteomics. Results: To be precise, convolutional neural networks have been used for identifying the functional role of amino acid sequences, because of its ability to give nearly accurate results for multi-label classification problems. Here we present a modified convolutional deep learning model that can identify if a given amino acid sequence is a spike protein or not based on the length of the sequence and the function of the protein, that will be done with a short execution time and a relatively small error rate. Conclusion: CNN is an efficient tool at supervised multilabel classification problems

Download Full-text

Enhancing protein backbone angle prediction by using simpler models of deep neural networks

Scientific Reports ◽

10.1038/s41598-020-76317-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Fereshteh Mataeimoghadam ◽

M. A. Hakim Newton ◽

Abdollah Dehzangi ◽

Abdul Karim ◽

B. Jayaram ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Structure Prediction ◽

Protein Structures ◽

Absolute Error ◽

Grand Challenge ◽

Protein Backbone ◽

The Neural Network ◽

Benchmark Datasets ◽

The Neural Networks

Abstract Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP can significantly outperform existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are 6–8 in terms of mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.

Download Full-text

Molecular Identification of Family 38 α-Mannosidase of Bacillus sp. Strain GL1, Responsible for Complete Depolymerization of Xanthan

Applied and Environmental Microbiology ◽

10.1128/aem.68.6.2731-2736.2002 ◽

2002 ◽

Vol 68 (6) ◽

pp. 2731-2736 ◽

Cited By ~ 15

Author(s):

Hirokazu Nankai ◽

Wataru Hashimoto ◽

Kousaku Murata

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Cell Extract ◽

Amino Acid Sequences ◽

Glycoside Hydrolase Family ◽

Sequence Information ◽

Reading Frame ◽

A Cell ◽

Terminal Amino ◽

Amino Acid Sequence Information

ABSTRACT When cells of Bacillus sp. strain GL1 were grown in a medium containing xanthan as a carbon source, α-mannosidase exhibiting activity toward p-nitrophenyl-α-d-mannopyranoside (pNP-α-d-Man) was produced intracellularly. The 350-kDa α-mannosidase purified from a cell extract of the bacterium was a trimer comprising three identical subunits, each with a molecular mass of 110 kDa. The enzyme hydrolyzed pNP-α-d-Man (Km = 0.49 mM) and d-mannosyl-(α-1,3)-d-glucose most efficiently at pH 7.5 to 9.0, indicating that the enzyme catalyzes the last step of the xanthan depolymerization pathway of Bacillus sp. strain GL1. The gene for α-mannosidase cloned most by using N-terminal amino acid sequence information contained an open reading frame (3,144 bp) capable of coding for a polypeptide with a molecular weight of 119,239. The deduced amino acid sequence showed homology with the amino acid sequences of α-mannosidases belonging to glycoside hydrolase family 38.

Download Full-text

Old versus new characters for systematics: Cautionary tales from virology

Australian Systematic Botany ◽

10.1071/sb9900159 ◽

1990 ◽

Vol 3 (1) ◽

pp. 159

Author(s):

A Gibbs ◽

A Ding ◽

J Howe ◽

P Keese ◽

A MacKenzie ◽

...

Keyword(s):

Amino Acid ◽

Genetic Recombination ◽

Large Subunit ◽

Molecular Taxonomy ◽

Amino Acid Sequences ◽

Sequence Information ◽

Molecular Sequence ◽

Taxonomic Methods ◽

Cautionary Tales

Molecular sequence information about viruses has mostly confirmed the groupings devised by traditional taxonomic methods, but shown in addition that the genes of related species may differ in number, arrangement, orientation and in sequence homology. It has also revealed that true genetic recombination between viruses has been common, even among those with RNA genomes, indeed most virus groups seem to have arisen y recombination. Thus, there is an unexpected wealth of genetic chaos hidden behind the fatade of the phenotype, and it is possible that the difficulties that plant taxonomists have had in identifying the relationships of the major groupings of plants could have similar causes. Nonetheless, molecular taxonomy does give sensible results and this is illustrated by a classification of the large subunit Rubisco proteins of 21 plant species based on their amino acid sequences.

Download Full-text

PROTEIN METAL BINDING RESIDUE PREDICTION BASED ON NEURAL NETWORKS

International Journal of Neural Systems ◽

10.1142/s0129065705000116 ◽

2005 ◽

Vol 15 (01n02) ◽

pp. 71-84 ◽

Cited By ~ 37

Author(s):

CHIN-TENG LIN ◽

KEN-LI LIN ◽

CHIH-HSIEN YANG ◽

I-FANG CHUNG ◽

CHUEN-DER HUANG ◽

...

Keyword(s):

Neural Networks ◽

Amino Acid ◽

Metal Ions ◽

Metal Binding ◽

Protein Structures ◽

Direct Analysis ◽

Sequence Information ◽

Amino Acid Residues ◽

Binding Residue

Over one-third of protein structures contain metal ions, which are the necessary elements in life systems. Traditionally, structural biologists were used to investigate properties of metalloproteins (proteins which bind with metal ions) by physical means and interpreting the function formation and reaction mechanism of enzyme by their structures and observations from experiments in vitro. Most of proteins have primary structures (amino acid sequence information) only; however, the 3-dimension structures are not always available. In this paper, a direct analysis method is proposed to predict the protein metal-binding amino acid residues from its sequence information only by neural networks with sliding window-based feature extraction and biological feature encoding techniques. In four major bulk elements (Calcium, Potassium, Magnesium, and Sodium), the metal-binding residues are identified by the proposed method with higher than 90% sensitivity and very good accuracy under 5-fold cross validation. With such promising results, it can be extended and used as a powerful methodology for metal-binding characterization from rapidly increasing protein sequences in the future.

Download Full-text