Identifying molecular recognition features in intrinsically disordered regions of proteins by transfer learning

Abstract Motivation Protein intrinsic disorder describes the tendency of sequence residues to not fold into a rigid three-dimensional shape by themselves. However, some of these disordered regions can transition from disorder to order when interacting with another molecule in segments known as molecular recognition features (MoRFs). Previous analysis has shown that these MoRF regions are indirectly encoded within the prediction of residue disorder as low-confidence predictions [i.e. in a semi-disordered state P(D)≈0.5]. Thus, what has been learned for disorder prediction may be transferable to MoRF prediction. Transferring the internal characterization of protein disorder for the prediction of MoRF residues would allow us to take advantage of the large training set available for disorder prediction, enabling the training of larger analytical models than is currently feasible on the small number of currently available annotated MoRF proteins. In this paper, we propose a new method for MoRF prediction by transfer learning from the SPOT-Disorder2 ensemble models built for disorder prediction. Results We confirm that directly training on the MoRF set with a randomly initialized model yields substantially poorer performance on independent test sets than by using the transfer-learning-based method SPOT-MoRF, for both deep and simple networks. Its comparison to current state-of-the-art techniques reveals its superior performance in identifying MoRF binding regions in proteins across two independent testing sets, including our new dataset of >800 protein chains. These test chains share <30% sequence similarity to all training and validation proteins used in SPOT-Disorder2 and SPOT-MoRF, and provide a much-needed large-scale update on the performance of current MoRF predictors. The method is expected to be useful in locating functional disordered regions in proteins. Availability and implementation SPOT-MoRF and its data are available as a web server and as a standalone program at: http://sparks-lab.org/jack/server/SPOT-MoRF/index.php. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A sequence-based computational method for prediction of MoRFs

RSC Advances ◽

10.1039/c6ra27161h ◽

2017 ◽

Vol 7 (31) ◽

pp. 18937-18945 ◽

Cited By ~ 5

Author(s):

Yu Wang ◽

Yanzhi Guo ◽

Xuemei Pu ◽

Menglong Li

Keyword(s):

Molecular Recognition ◽

Computational Method ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Partner Proteins ◽

Molecular Recognition Features ◽

Disordered Regions

Molecular recognition features (MoRFs) are relatively short segments (10–70 residues) within intrinsically disordered regions (IDRs) that can undergo disorder-to-order transitions during binding to partner proteins.

Download Full-text

IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence learning

Bioinformatics ◽

10.1093/bioinformatics/btaa667 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yi-Jun Tang ◽

Yi-He Pang ◽

Bin Liu

Keyword(s):

Language Processing ◽

Sequence Space ◽

Sequence Learning ◽

Function Analysis ◽

Predictive Performance ◽

Semantic Space ◽

Supplementary Information ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Disordered Regions

Abstract Motivation Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the ‘semantic space’ to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to ‘semantic space’ to reflect the structure patterns with the help of predicted residue–residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods. Availability and implementation For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bliulab.net/IDP-Seq2Seq/. It is anticipated that IDP-Seq2Seq will become a very useful tool for identification of IDRs. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Role of “dual-personality” fragments in HEV adaptation—analysis of Y-domain region

Journal of Genetic Engineering and Biotechnology ◽

10.1186/s43141-021-00238-8 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Zoya Shafat ◽

Anwar Ahmed ◽

Mohammad K. Parvez ◽

Shama Parveen

Keyword(s):

Molecular Recognition ◽

Functional Annotation ◽

Rna Binding ◽

Intrinsic Disorder ◽

Hepatitis E ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Wide Range ◽

Disordered Regions

Abstract Background Hepatitis E is a liver disease caused by the pathogen hepatitis E virus (HEV). The largest polyprotein open reading frame 1 (ORF1) contains a nonstructural Y-domain region (YDR) whose activity in HEV adaptation remains uncharted. The specific role of disordered regions in several nonstructural proteins has been demonstrated to participate in the multiplication and multiple regulatory functions of the viruses. Thus, intrinsic disorder of YDR including its structural and functional annotation was comprehensively studied by exploiting computational methodologies to delineate its role in viral adaptation. Results Based on our findings, it was evident that YDR contains significantly higher levels of ordered regions with less prevalence of disordered residues. Sequence-based analysis of YDR revealed it as a “dual personality” (DP) protein due to the presence of both structured and unstructured (intrinsically disordered) regions. The evolution of YDR was shaped by pressures that lead towards predominance of both disordered and regularly folded amino acids (Ala, Arg, Gly, Ile, Leu, Phe, Pro, Ser, Tyr, Val). Additionally, the predominance of characteristic DP residues (Thr, Arg, Gly, and Pro) further showed the order as well as disorder characteristic possessed by YDR. The intrinsic disorder propensity analysis of YDR revealed it as a moderately disordered protein. All the YDR sequences consisted of molecular recognition features (MoRFs), i.e., intrinsic disorder-based protein–protein interaction (PPI) sites, in addition to several nucleotide-binding sites. Thus, the presence of molecular recognition (PPI, RNA binding, and DNA binding) signifies the YDR’s interaction with specific partners, host membranes leading to further viral infection. The presence of various disordered-based phosphorylation sites further signifies the role of YDR in various biological processes. Furthermore, functional annotation of YDR revealed it as a multifunctional-associated protein, due to its susceptibility in binding to a wide range of ligands and involvement in various catalytic activities. Conclusions As DP are targets for regulation, thus, YDR contributes to cellular signaling processes through PPIs. As YDR is incompletely understood, therefore, our data on disorder-based function could help in better understanding its associated functions. Collectively, our novel data from this comprehensive investigation is the first attempt to delineate YDR role in the regulation and pathogenesis of HEV.

Download Full-text

IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regions

Bioinformatics ◽

10.1093/bioinformatics/btaa618 ◽

2020 ◽

Vol 36 (20) ◽

pp. 4977-4983 ◽

Cited By ~ 1

Author(s):

Jing-Bo Zhou ◽

Yao Xiong ◽

Ke An ◽

Zhi-Qiang Ye ◽

Yun-Dong Wu

Keyword(s):

Prediction Models ◽

Disease Association ◽

Training Data ◽

Supplementary Information ◽

Single Nucleotide Variants ◽

Sequence Alignments ◽

Single Nucleotide ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Disordered Regions

Abstract Motivation Despite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance. Results We present IDRMutPred, a machine learning-based tool specifically for predicting disease-associated germline nsSNVs in IDRs. Based on 17 selected optimal features that are extracted from sequence alignments, protein annotations, hydrophobicity indices and disorder scores, IDRMutPred was trained using three ensemble learning algorithms on the training dataset containing only IDR nsSNVs. The evaluation on the two testing datasets shows that all the three prediction models outperform 17 other popular general predictors significantly, achieving the ACC between 0.856 and 0.868 and MCC between 0.713 and 0.737. IDRMutPred will prioritize disease-associated IDR germline nsSNVs more reliably than general predictors. Availability and implementation The software is freely available at http://www.wdspdb.com/IDRMutPred. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Promiscuity as a functional trait: intrinsically disordered regions as central players of interactomes

Biochemical Journal ◽

10.1042/bj20130545 ◽

2013 ◽

Vol 454 (3) ◽

pp. 361-369 ◽

Cited By ~ 108

Author(s):

Alexander Cumberworth ◽

Guillaume Lamour ◽

M. Madan Babu ◽

Jörg Gsponer

Keyword(s):

Low Complexity ◽

Functional Trait ◽

Protein Interaction Networks ◽

Altered Expression ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Linear Motifs ◽

Eukaryotic Genomes ◽

Molecular Recognition Features ◽

Disordered Regions

Because of their pervasiveness in eukaryotic genomes and their unique properties, understanding the role that ID (intrinsically disordered) regions in proteins play in the interactome is essential for gaining a better understanding of the network. Especially critical in determining this role is their ability to bind more than one partner using the same region. Studies have revealed that proteins containing ID regions tend to take a central role in protein interaction networks; specifically, they act as hubs, interacting with multiple different partners across time and space, allowing for the co-ordination of many cellular activities. There appear to be three different modules within ID regions responsible for their functionally promiscuous behaviour: MoRFs (molecular recognition features), SLiMs (small linear motifs) and LCRs (low complexity regions). These regions allow for functionality such as engaging in the formation of dynamic heteromeric structures which can serve to increase local activity of an enzyme or store a collection of functionally related molecules for later use. However, the use of promiscuity does not come without a cost: a number of diseases that have been associated with ID-containing proteins seem to be caused by undesirable interactions occurring upon altered expression of the ID-containing protein.

Download Full-text