scholarly journals Evaluating Autoencoder-Based Featurization and Supervised Learning for Protein Decoy Selection

Molecules ◽  
2020 ◽  
Vol 25 (5) ◽  
pp. 1146 ◽  
Author(s):  
Fardina Fathmiul Alam ◽  
Taseef Rahman ◽  
Amarda Shehu

Rapid growth in molecular structure data is renewing interest in featurizing structure. Featurizations that retain information on biological activity are particularly sought for protein molecules, where decades of research have shown that indeed structure encodes function. Research on featurization of protein structure is active, but here we assess the promise of autoencoders. Motivated by rapid progress in neural network research, we investigate and evaluate autoencoders on yielding linear and nonlinear featurizations of protein tertiary structures. An additional reason we focus on autoencoders as the engine to obtain featurizations is the versatility of their architectures and the ease with which changes to architecture yield linear versus nonlinear features. While open-source neural network libraries, such as Keras, which we employ here, greatly facilitate constructing, training, and evaluating autoencoder architectures and conducting model search, autoencoders have not yet gained popularity in the structure biology community. Here we demonstrate their utility in a practical context. Employing autoencoder-based featurizations, we address the classic problem of decoy selection in protein structure prediction. Utilizing off-the-shelf supervised learning methods, we demonstrate that the featurizations are indeed meaningful and allow detecting active tertiary structures, thus opening the way for further avenues of research.

2021 ◽  
Author(s):  
Yong-Chang Xu ◽  
Tian-Jun ShangGuan ◽  
Xue-Ming Ding ◽  
Ngaam J. Cheung

The amino acid sequence of a protein contains all the necessary information to specify its shape, which dictates its biological activities. However, it is challenging and expensive to experimentally determine the three-dimensional structure of proteins. The backbone torsion angles, as an important structural constraint, play a critical role in protein structure prediction, and accurately predicting the angles can considerably advance the tertiary structure prediction by accelerating efficient sampling of the large conformational space for low energy structures. On account of the rapid growth of protein databases and striking breakthroughs in deep learning algorithms, computational advances allow us to extract knowledge from large-scale data to address key biological questions. Here we propose evolutionary signatures that are computed from protein sequence profiles, and a deep neural network, termed ESIDEN, that adopts a straightforward architecture of recurrent neural networks with a small number of learnable parameters. The proposed ESIDEN is validated on three benchmark datasets, including D2020, TEST2016/2018, and CASPs datasets. On the D2020, using the combination of the four novel features and basic features, the ESIDEN achieves the mean absolute error (MAE) of 15.8 and 20.1 for ϕ and ψ, respectively. Comparing to the best-so-far methods, we show that the ESIDEN significantly improves the angle ψ by the MAE decrements of more than 2 degrees on both TEST2016 and TEST2018 and achieves closely approximate MAE of the angle ϕ although it adopts simple architecture and fewer learnable parameters. On fifty-nine template-free modeling targets, the ESIDEN achieves high accuracy by reducing the MAEs by about 0.4 and more than 2.5 degrees on average for the torsion angles ϕ and ψ in the CASPs, respectively. Using the predicted torsion angles, we infer the tertiary structures of four representative template-free modeling targets that achieve high precision with regard to the root-mean-square deviation and TM-score by comparing them to the native structures. The results demonstrate that the ESIDEN can make accurate predictions of the torsion angles by leveraging the evolutionary signatures compared to widely used classical features. The proposed evolutionary signatures would be also used as alternative features in predicting residue-residue distance, protein structure, and protein-ligand binding sites. Moreover, the high-precision torsion angles predicted by the ESIDEN can be used to accurately infer protein tertiary structures, and the ESIDEN would potentially pave the way to improve protein structure prediction.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lupeng Kong ◽  
Fusong Ju ◽  
Haicang Zhang ◽  
Shiwei Sun ◽  
Dongbo Bu

Abstract Background Accurate prediction of protein tertiary structures is highly desired as the knowledge of protein structures provides invaluable insights into protein functions. We have designed two approaches to protein structure prediction, including a template-based modeling approach (called ProALIGN) and an ab initio prediction approach (called ProFOLD). Briefly speaking, ProALIGN aligns a target protein with templates through exploiting the patterns of context-specific alignment motifs and then builds the final structure with reference to the homologous templates. In contrast, ProFOLD uses an end-to-end neural network to estimate inter-residue distances of target proteins and builds structures that satisfy these distance constraints. These two approaches emphasize different characteristics of target proteins: ProALIGN exploits structure information of homologous templates of target proteins while ProFOLD exploits the co-evolutionary information carried by homologous protein sequences. Recent progress has shown that the combination of template-based modeling and ab initio approaches is promising. Results In the study, we present FALCON2, a web server that integrates ProALIGN and ProFOLD to provide high-quality protein structure prediction service. For a target protein, FALCON2 executes ProALIGN and ProFOLD simultaneously to predict possible structures and selects the most likely one as the final prediction result. We evaluated FALCON2 on widely-used benchmarks, including 104 CASP13 (the 13th Critical Assessment of protein Structure Prediction) targets and 91 CASP14 targets. In-depth examination suggests that when high-quality templates are available, ProALIGN is superior to ProFOLD and in other cases, ProFOLD shows better performance. By integrating these two approaches with different emphasis, FALCON2 server outperforms the two individual approaches and also achieves state-of-the-art performance compared with existing approaches. Conclusions By integrating template-based modeling and ab initio approaches, FALCON2 provides an easy-to-use and high-quality protein structure prediction service for the community and we expect it to enable insights into a deep understanding of protein functions.


Author(s):  
Lina Yang ◽  
Pu Wei ◽  
Cheng Zhong ◽  
Xichun Li ◽  
Yuan Yan Tang

The spatial structure of the protein reflects the biological function and activity mechanism. Predicting the secondary structure of a protein is the basis content for predicting its spatial structure. Traditional methods based on statistics and sequential patterns do not achieve higher accuracy. In this paper, the application of BN-GRU neural network in protein structure prediction is discussed. The main idea is to construct a Gated Recurrent Unit (GRU) neural network. The GRU neural network can learn long-term dependencies. It can handle long sequences better than traditional methods. Based on this, BN is combined with GRU to construct a new network. Position Specific Scoring Matrix (PSSM) is used to associate with other features to build a completely new feature set. It can be proved that the application of BN on GRU can improve the accuracy of the results. The idea in this paper can also be applied to the analysis of similarity of other sequences.


Molecules ◽  
2020 ◽  
Vol 25 (9) ◽  
pp. 2228
Author(s):  
Ahmed Bin Zaman ◽  
Parastoo Kamranfar ◽  
Carlotta Domeniconi ◽  
Amarda Shehu

Controlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs. In this paper, we propose a novel clustering-based approach which we demonstrate to significantly reduce an ensemble of generated structures without sacrificing quality. Evaluations are related on both benchmark and CASP target proteins. Structure ensembles subjected to the proposed approach and the source code of the proposed approach are publicly-available at the links provided in Section 1.


2021 ◽  
Vol 11 (Suppl_1) ◽  
pp. S13-S13
Author(s):  
Valery Novoseletsky ◽  
Mikhail Lozhnikov ◽  
Grigoriy Armeev ◽  
Aleksandr Kudriavtsev ◽  
Alexey Shaytan ◽  
...  

Background: Protein structure determination using X-ray free-electron laser (XFEL) includes analysis and merging a large number of snapshot diffraction patterns. Convolutional neural networks are widely used to solve numerous computer vision problems, e.g. image classification, and can be used for diffraction pattern analysis. But the task of protein structure determination with the use of CNNs only is not yet solved. Methods: We simulated the diffraction patterns using the Condor software library and obtained more than 1000 diffraction patterns for each structure with simulation parameters resembling real ones. To classify diffraction patterns, we tried two approaches, which are widely known in the area of image classification: a classic VGG network and residual networks. Results: 1. Recognition of a protein class (GPCRs vs globins). Globins and GPCR-like proteins are typical α-helical proteins. Each of these protein families has a large number of representatives (including those with known structure) but we used only 8 structures from every family. 12,000 of diffraction patterns were used for training and 4,000 patterns for testing. Results indicate that all considered networks are able to recognize the protein family type with high accuracy. 2. Recognition of the number of protein molecules in the liposome. We considered the usage of lyposomes as carriers of membrane or globular proteins for sample delivery in XFEL experiments in order to improve the X-ray beam hit rate. Three sets of diffractograms for liposomes of various radius were calculated, including diffractograms for empty liposomes, liposomes loaded with 5 bacteriorhodopsin molecules, and liposomes loaded with 10 bacteriorhodopsin molecules. The training set consisted of 23625 diffraction patterns, and test set of 7875 patterns. We found that all networks used in our study were able to identify the number of protein molecules in liposomes independent of the liposome radius. Our findings make this approach rather promising for the usage of liposomes as protein carriers in XFEL experiments. Conclusion: Thus, the performed numerical experiments show that the use of neural network algorithms for the recognition of diffraction images from single macromolecular particles makes it possible to determine changes in the structure at the angstrom scale.


2018 ◽  
Author(s):  
Hiroyuki Fukuda ◽  
Kentaro Tomii

AbstractProtein contact prediction is a crucially important step for protein structure prediction. To predict a contact, approaches of two types are used: evolutionary coupling analysis (ECA) and supervised learning. ECA uses a large multiple sequence alignment (MSA) of homologue sequences and extract correlation information between residues. Supervised learning uses ECA analysis results as input features and can produce higher accuracy. As described herein, we present a new approach to contact prediction which can both extract correlation information and predict contacts in a supervised manner directly from MSA using a deep neural network (DNN). Using DNN, we can obtain higher accuracy than with earlier ECA methods. Simultaneously, we can weight each sequence in MSA to eliminate noise sequences automatically in a supervised way. It is expected that the combination of our method and other meta-learning methods can provide much higher accuracy of contact prediction.


10.29007/j5p9 ◽  
2019 ◽  
Author(s):  
Ahmed Bin Zaman ◽  
Amarda Shehu

A central challenge in template-free protein structure prediction is controlling the quality of computed tertiary structures also known as decoys. Given the size, dimensionality, and inherent characteristics of the protein structure space, this is non-trivial. The current mechanism employed by decoy generation algorithms relies on generating as many decoys as can be afforded. This is impractical and uninformed by any metrics of interest on a decoy dataset. In this paper, we propose to equip a decoy generation algorithm with an evolving map of the protein structure space. The map utilizes low-dimensional representations of protein structure and serves as a memory whose granularity can be controlled. Evaluations on diverse target sequences show that drastic reductions in storage do not sacrifice decoy quality, indicating the promise of the proposed mechanism for decoy generation algorithms in template-free protein structure prediction.


2019 ◽  
Author(s):  
Max Staples ◽  
Leong Chan ◽  
Dong Si ◽  
Kasey Johnson ◽  
Connor Whyte ◽  
...  

AbstractAI recently shows great promise in the field of bioinformatics, such as protein structure prediction. The Critical Assessment of protein Structure Prediction (CASP) is a nationwide experiment that takes place biannually, which centered around analyzing the best current systems for predicting protein tertiary structures. In this paper, we research on available AI methods and features, and then explore novel methods based on reinforcement learning. Such method will have profound implications for R&D in bioinformatics and add an additional platform to the management of innovation in biotechnology.


Sign in / Sign up

Export Citation Format

Share Document