scholarly journals A general framework to learn tertiary structure for protein sequence annotation

2021 ◽  
Author(s):  
Mu Gao ◽  
Jeffrey Skolnick

During the past five years, deep-learning algorithms have enabled ground-breaking progress towards the prediction of tertiary structure from a protein sequence. Very recently, we developed SAdLSA, a new computational algorithm for protein sequence comparison via deep-learning of protein structural alignments. SAdLSA shows significant improvement over established sequence alignment methods. In this contribution, we show that SAdLSA provides a general machine-learning framework for structurally characterizing protein sequences. By aligning a protein sequence against itself, SAdLSA generates a fold distogram for the input sequence, including challenging cases whose structural folds were not present in the training set. About 70% of the predicted distograms are statistically significant. Although at present the accuracy of the distogram predicted by SAdLSA self-alignment is not as good as deep-learning algorithms specifically trained for distogram prediction, it is remarkable that the prediction of single protein structures is encoded by an algorithm that learns ensembles of pairwise structural comparisons, without being explicitly trained to recognize individual structural folds. As such, SAdLSA can not only predict protein folds for individual sequences, but also detects subtle, yet significant, structural relationships between multiple protein sequences using the same deep-learning neural network. The former reduces to a special case in this general framework for protein sequence annotation.

2021 ◽  
Vol 1 ◽  
Author(s):  
Mu Gao ◽  
Jeffrey Skolnick

During the past five years, deep-learning algorithms have enabled ground-breaking progress towards the prediction of tertiary structure from a protein sequence. Very recently, we developed SAdLSA, a new computational algorithm for protein sequence comparison via deep-learning of protein structural alignments. SAdLSA shows significant improvement over established sequence alignment methods. In this contribution, we show that SAdLSA provides a general machine-learning framework for structurally characterizing protein sequences. By aligning a protein sequence against itself, SAdLSA generates a fold distogram for the input sequence, including challenging cases whose structural folds were not present in the training set. About 70% of the predicted distograms are statistically significant. Although at present the accuracy of the intra-sequence distogram predicted by SAdLSA self-alignment is not as good as deep-learning algorithms specifically trained for distogram prediction, it is remarkable that the prediction of single protein structures is encoded by an algorithm that learns ensembles of pairwise structural comparisons, without being explicitly trained to recognize individual structural folds. As such, SAdLSA can not only predict protein folds for individual sequences, but also detects subtle, yet significant, structural relationships between multiple protein sequences using the same deep-learning neural network. The former reduces to a special case in this general framework for protein sequence annotation.


SLEEP ◽  
2021 ◽  
Vol 44 (Supplement_2) ◽  
pp. A161-A162
Author(s):  
Soonhyun Yook ◽  
Chaitanya Gupte ◽  
Zhixian Han ◽  
Eun Yeon Joo ◽  
Hea Ree Park ◽  
...  

Abstract Introduction Using deep learning algorithms, we investigated univariate and multivariate effects of four polysomnography features including heart rate (HR), electrocardiogram (ECG), oxygen saturation (SpO2) and nasal air flow (NAF) on the identification of sleep apnea and hypopnea events. This explanatory analysis that may clarify the sensitivity and specificity of those features to SAs and SHs have not been probed. Methods We studied 804 polysomonography samples from 704 patients with obstructive sleep apnea and 100 controls. The input data were converted into scalograms as 4-channel 2D images to train Xception networks. For training, 77,638 patches were sampled from the original 6-hour sleep data with 30-second time width. A 10% of these patches were segregated as the test-set. With each feature sets, we tested the following classifications: 1) normal vs apnea vs hypopnea; 2) normal vs. apnea+hypopnea; 3) normal vs. apnea; and 4) normal vs. hypopnea. Results SpO2 classified normal vs. apnea most accurately (98%), followed by NAF (85%), ECG (77%), and HR (63%). SpO2 also showed the highest accuracy in classifying normal vs. hypopnea (87%), and normal vs. apnea+hypopnea (96%) and three groups (82%). When the combination of four features were used, the classification accuracies were generally improved compared to use of SpO2 only (normal vs. apnea 99%; vs. hypopnea 89%; vs. apnea+hypopnea: 94%; three groups: 86%). Conclusion Deep learning with SpO2 or NAF feature most accurately classified apneas from normal sleep events, suggesting these features’ characterization of sleep apnea events. Oxygen desaturation, which is a typical pattern of hypopnea, was only the feature showing reliable accuracy in classifying hypopnea vs. normal. Nevertheless, combination of four polysomnography features could improve the identification of sleep apnea and hypopnea. Furthermore, classifying normal vs. apnea+hypopnea was more accurate than separately classifying three groups, suggesting deep learning approaches as the primary screen tool. Since the classification accuracy of using SpO2 was higher than any other features, developing a portable equipment measuring SpO2 and running deep learning algorithms has the potential for inexpensive, accurate diagnostics of obstructive sleep apnea syndrome. Support (if any) This study was supported by USC STEVENS CENTER FOR INNOVATION TECHNOLOGY ADVANCEMENT GRANTS (TAG), BrightFocus Foundation Award (A2019052S).


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Johannes Linder ◽  
Georg Seelig

Abstract Background Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. Results Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp’s capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor. Conclusions Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.


2021 ◽  
Author(s):  
Fatemeh Zare-Mirakabad ◽  
Armin Behjati ◽  
Seyed Shahriar Arab ◽  
Abbas Nowzari-Dalini

Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.


2020 ◽  
Vol 10 (5) ◽  
pp. 6306-6316

Protein fold prediction is a milestone step towards predicting protein tertiary structure from protein sequence. It is considered one of the most researched topics in the area of Computational Biology. It has applications in the area of structural biology and medicines. Extracting sensitive features for prediction is a key step in protein fold prediction. The actionable features are extracted from keywords of sequence header and secondary structure representations of protein sequence. The keywords holding species information are used as features after verifying with uniref100 dataset using TaxId. Prominent patterns are identified experimentally based on the nature of protein structural class and protein fold. Global and native features are extracted capturing the nature of patterns experimentally. It is found that keywords based features have positive correlation with protein folds. Keywords indicating species are important for observing functional differences which help in guiding the prediction process. SCOPe 2.07 and EDD datasets are used. EDD is a benchmark dataset and SCOPe 2.07 is the latest and largest dataset holding astral protein sequences. The training set of SCOPe 2.07 is trained using 93 dimensional features vector using Random forest algorithm. The prediction results of SCOPe 2.07 test set reports the accuracy of better than 95%. The accuracy achieved on benchmark dataset EDD is better than 93%, which is best reported as per our knowledge.


2019 ◽  
Vol 116 (28) ◽  
pp. 13996-14001 ◽  
Author(s):  
Jae Yong Ryu ◽  
Hyun Uk Kim ◽  
Sang Yup Lee

High-quality and high-throughput prediction of enzyme commission (EC) numbers is essential for accurate understanding of enzyme functions, which have many implications in pathologies and industrial biotechnology. Several EC number prediction tools are currently available, but their prediction performance needs to be further improved to precisely and efficiently process an ever-increasing volume of protein sequence data. Here, we report DeepEC, a deep learning-based computational framework that predicts EC numbers for protein sequences with high precision and in a high-throughput manner. DeepEC takes a protein sequence as input and predicts EC numbers as output. DeepEC uses 3 convolutional neural networks (CNNs) as a major engine for the prediction of EC numbers, and also implements homology analysis for EC numbers that cannot be classified by the CNNs. Comparative analyses against 5 representative EC number prediction tools show that DeepEC allows the most precise prediction of EC numbers, and is the fastest and the lightest in terms of the disk space required. Furthermore, DeepEC is the most sensitive in detecting the effects of mutated domains/binding site residues of protein sequences. DeepEC can be used as an independent tool, and also as a third-party software component in combination with other computational platforms that examine metabolic reactions.


Author(s):  
Mu Gao ◽  
Jeffrey Skolnick

Abstract Motivation From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the ‘twilight zone’ of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent ‘d’). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures. Results To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure α-helical proteins successfully recognizes pairs of structurally related pure β-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is ∼150% better than HHsearch for generating pairwise alignments and ∼50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration. Availability and implementation Datasets and source codes of SAdLSA are available free of charge for academic users at http://sites.gatech.edu/cssb/sadlsa/. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Allan Costa ◽  
Manvitha Ponnapati ◽  
Joseph M Jacobson ◽  
Pranam Chatterjee

Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yang Wang ◽  
Zhanchao Li ◽  
Yanfei Zhang ◽  
Yingjun Ma ◽  
Qixing Huang ◽  
...  

Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/.


2020 ◽  
Author(s):  
Javier Caceres-Delpiano ◽  
Roberto Ibañez ◽  
Patricio Alegre ◽  
Cynthia Sanhueza ◽  
Romualdo Paz-Fiblas ◽  
...  

AbstractProtein sequences are highly dimensional and present one of the main problems for the optimization and study of sequence-structure relations. The intrinsic degeneration of protein sequences is hard to follow, but the continued discovery of new protein structures has shown that there is convergence in terms of the possible folds that proteins can adopt, such that proteins with sequence identities lower than 30% may still fold into similar structures. Given that proteins share a set of conserved structural motifs, machine-learning algorithms can play an essential role in the study of sequence-structure relations. Deep-learning neural networks are becoming an important tool in the development of new techniques, such as protein modeling and design, and they continue to gain power as new algorithms are developed and as increasing amounts of data are released every day. Here, we trained a deep-learning model based on previous recurrent neural networks to design analog protein structures using representations learning based on the evolutionary and structural information of proteins. We test the capabilities of this model by creating de novo variants of an antifungal peptide, with sequence identities of 50% or lower relative to the wild-type (WT) peptide. We show by in silico approximations, such as molecular dynamics, that the new variants and the WT peptide can successfully bind to a chitin surface with comparable relative binding energies. These results are supported by in vitro assays, where the de novo designed peptides showed antifungal activity that equaled or exceeded the WT peptide.


Sign in / Sign up

Export Citation Format

Share Document