Developing structural profile matrices for protein secondary structure and solvent accessibility prediction

2019 ◽  
Vol 35 (20) ◽  
pp. 4004-4010 ◽  
Author(s):  
Zafer Aydin ◽  
Nuh Azginoglu ◽  
Halil Ibrahim Bilgin ◽  
Mete Celik

Abstract Motivation Predicting secondary structure and solvent accessibility of proteins are among the essential steps that preclude more elaborate 3D structure prediction tasks. Incorporating class label information contained in templates with known structures has the potential to improve the accuracy of prediction methods. Building a structural profile matrix is one such technique that provides a distribution for class labels at each amino acid position of the target. Results In this paper, a new structural profiling technique is proposed that is based on deriving PFAM families and is combined with an existing approach. Cross-validation experiments on two benchmark datasets and at various similarity intervals demonstrate that the proposed profiling strategy performs significantly better than Homolpro, a state-of-the-art method for incorporating template information, as assessed by statistical hypothesis tests. Availability and implementation The DSPRED method can be accessed by visiting the PSP server at http://psp.agu.edu.tr. Source code and binaries are freely available at https://github.com/yusufzaferaydin/dspred. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Aminur Rab Ratul ◽  
Marcel Turcotte ◽  
M. Hamed Mozaffari ◽  
WonSook Lee

AbstractProtein secondary structure is crucial to create an information bridge between the primary structure and the tertiary (3D) structure. Precise prediction of 8-state protein secondary structure (PSS) significantly utilized in the structural and functional analysis of proteins in bioinformatics. In this recent period, deep learning techniques have been applied in this research area and raise the Q8 accuracy remarkably. Nevertheless, from a theoretical standpoint, there still lots of room for improvement, specifically in 8-state (Q8) protein secondary structure prediction. In this paper, we presented two deep learning architecture, namely 1D-Inception and BD-LSTM, to improve the performance of 8-classes PSS prediction. The input of these two architectures is a carefully constructed feature matrix from the sequence features and profile features of the proteins. Firstly, 1D-Inception is a Deep convolutional neural network-based approach that was inspired by the InceptionV3 model and containing three inception modules. Secondly, BD-LSTM is a recurrent neural network model which including bidirectional LSTM layers. Our proposed 1D-Inception method achieved 76.65%, 71.18%, 76.86%, and 74.07% Q8 accuracy respectively on benchmark CullPdb6133, CB513, CASP10, and CASP11 datasets. Moreover, BD-LSTM acquired 74.71%, 69.49%, 74.07%, and 72.37% state-8 accuracy after evaluated on CullPdb6133, CB513, CASP10, and CASP11 datasets, respectively. Both these architectures enable the efficient processing of local and global interdependencies between amino acids to make an accurate prediction of each class is very beneficial in the deep neural network. To the best of our knowledge, experiment results of the 1D-Inception model demonstrate that it outperformed all the state-of-art methods on the benchmark CullPdb6133, CB513, and CASP10 datasets.


2021 ◽  
Author(s):  
Katarzyna Stapor ◽  
Krzysztof Kotowski ◽  
Tomasz Smolarczyk ◽  
Irena Roterman

Abstract Background: The importance of protein secondary structure (SS) prediction is widely known, its solution enables learning about the role of a protein in organisms. As the experimental methods are expensive and sometimes impossible, many SS predictors, mainly based on different machine learning methods have been proposed for many years. SS prediction as the imbalanced classification problem should not be judged by the commonly used Q3/Q8 metrics. Moreover, as the benchmark datasets are not random samples, the classical statistical null hypothesis testing based on the Neyman-Pearson approach is not appropriate. Also, the state-of-the-art predictors have usually relatively long prediction times.Results: We present a new deep network ProteinUnet2 for SS prediction which is based on U-Net convolutional architecture. We also propose a new statistical methodology for prediction performance assessment based on the significance from Fisher-Pitman permutation tests accompanied by practical significance measured by Cohen’s effect size. Through an extensive evaluation study, we report the performance of ProteinUnet2 in comparison with two state-of-the-art methods SAINT and SPOT-1D on benchmark datasets TEST2016, TEST2018, and CASP12. Conclusions: Our results suggest that ProteinUnet2 has much shorter prediction times while maintaining (or outperforming) the mentioned predictors. We strongly believe that our proposed statistical methodology will be adopted and used (and even expanded) by the research community.


2019 ◽  
Vol 12 (03) ◽  
pp. 1950017
Author(s):  
Ye Chen ◽  
Xiaoping Yuan ◽  
Xiaohui Cang

Protein structure prediction is the prediction of the 3D structure of a protein based on its amino acid sequence. It is a key component in disciplines such as medicine, biology, and biochemistry. The prediction of the protein secondary structure of Homo sapiens is one of the more important domains. Many methods have been used to feed forward neural networks or SVMs combined with a sliding window. This method’s mechanisms are too complex to be able to extract clear and straightforward physical meanings from it. This paper explores population-based incremental learning (PBIL), which is a method that combines the mechanisms of a generational genetic algorithm with simple competitive learning. The result shows that its accuracies are particularly associated with the Homo species. This new perspective reveals a number of different possibilities for the purposes of performance improvements.


2019 ◽  
Author(s):  
Diksha Priya Lotun ◽  
Charlotte Cochard ◽  
Fabio R.J Vieira ◽  
Juliana Silva Bernardes

2dSS is a web-server for visualising and comparing secondary structure predictions. It provides two main functionalities: 2D-alignment and compare predictions. The “2D-alignment” has been designed to visualise conserved secondary structure elements in a multiple sequence alignment (MSA). From this we can study the secondary structure content of homologous proteins (a protein family) and highlight its structural patterns. The “compare predictions” has been designed to compare the output of several secondary structure prediction tools, and check their accuracy when compared with real secondary structure elements extracted from 3D-structure. 2dSS provides a comprehensive representation of protein secondary structure elements, and it can be used to visualise and compare secondary structures of any prediction tool.Availabilityhttp://genome.lcqb.upmc.fr/2dss/


2020 ◽  
Vol 18 (04) ◽  
pp. 2050022
Author(s):  
Nuh Azginoglu ◽  
Zafer Aydin ◽  
Mete Celik

Predicting structural properties of proteins plays a key role in predicting the 3D structure of proteins. In this study, new structural profile matrices (SPM) are developed for protein secondary structure, solvent accessibility and torsion angle class predictions, which could be used as input to 3D prediction algorithms. The structural templates employed in computing SPMs are detected by eight alignment methods in LOMETS server, gap affine alignment method, ScanProsite, PfamScan, and HHblits. The contribution of each template is weighted by its similarity to target, which is assessed by several sequence alignment scores. For comparison, the SPMs are also computed using Homolpro, which uses BLAST for target template alignments and does not assign weights to templates. Incorporating the SPMs into DSPRED classifier, the prediction accuracy improves significantly as demonstrated by cross-validation experiments on two difficult benchmarks. The most accurate predictions are obtained using the SPMs derived by threading methods in LOMETS server. On the other hand, the computational cost of computing these SPMs was the highest.


Author(s):  
Gang Xu ◽  
Qinghua Wang ◽  
Jianpeng Ma

Abstract Motivation The development of an open-source platform to predict protein 1D features and 3D structure is an important task. In this paper, we report an open-source toolkit for protein 3D structure modeling, named OPUS-X. It contains three modules: OPUS-TASS2, which predicts protein torsion angles, secondary structure and solvent accessibility; OPUS-Contact, which measures the distance and orientation information between different residue pairs; and OPUS-Fold2, which uses the constraints derived from the first two modules to guide folding. Results OPUS-TASS2 is an upgraded version of our previous method OPUSS-TASS. OPUS-TASS2 integrates protein global structure information and significantly outperforms OPUS-TASS. OPUS-Contact combines multiple raw co-evolutionary features with protein 1D features predicted by OPUS-TASS2, and delivers better results than the open-source state-of-the-art method trRosetta. OPUS-Fold2 is a complementary version of our previous method OPUS-Fold. OPUS-Fold2 is a gradient-based protein folding framework based on the differentiable energy terms in opposed to OPUS-Fold that is a sampling-based method used to deal with the non-differentiable terms. OPUS-Fold2 exhibits comparable performance to the Rosetta folding protocol in trRosetta when using identical inputs. OPUS-Fold2 is written in Python and TensorFlow2.4, which is user-friendly to any source-code level modification. Availability The code and pre-trained models of OPUS-X can be downloaded from https://github.com/OPUS-MaLab/opus_x. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Saad O.A. Subair ◽  
Safaai Deris

Protein secondary-structure prediction is a fundamental step in determining the 3D structure of a protein. In this chapter, a new method for predicting protein secondary structure from amino-acid sequences has been proposed and implemented. Cuff and Barton 513 protein data set is used in training and testing the prediction methods under the same hardware, platforms, and environments. The newly developed method utilizes the knowledge of the GOR-V information theory and the power of the neural networks to classify a novel protein sequence in one of its three secondary-structures classes (i.e., helices, strands, and coils). The newly developed method (NN-GORV-I) is further improved by applying a filtering mechanism to the searched database and hence named NN-GORV-II. The developed prediction methods are rigorously analyzed and tested together with the other five well-known prediction methods in this domain to allow easy comparison and clear conclusions.


Author(s):  
Louis Becquey ◽  
Eric Angel ◽  
Fariza Tahi

Abstract Motivation Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning. Results Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided. Availability and implementation The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document