scholarly journals ODiNPred: comprehensive prediction of protein order and disorder

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Rupashree Dass ◽  
Frans A. A. Mulder ◽  
Jakob Toudahl Nielsen

Abstract Structural disorder is widespread in eukaryotic proteins and is vital for their function in diverse biological processes. It is therefore highly desirable to be able to predict the degree of order and disorder from amino acid sequence. It is, however, notoriously difficult to predict the degree of local flexibility within structured domains and the presence and nuances of localized rigidity within intrinsically disordered regions. To identify such instances, we used the CheZOD database, which encompasses accurate, balanced, and continuous-valued quantification of protein (dis)order at amino acid resolution based on NMR chemical shifts. To computationally forecast the spectrum of protein disorder in the most comprehensive manner possible, we constructed the sequence-based protein order/disorder predictor ODiNPred, trained on an expanded version of CheZOD. ODiNPred applies a deep neural network comprising 157 unique sequence features to 1325 protein sequences together with the experimental NMR chemical shift data. Cross-validation for 117 protein sequences shows that ODiNPred better predicts the continuous variation in order along the protein sequence, suggesting that contemporary predictors are limited by the quality of training data. The inclusion of evolutionary features reduces the performance gap between ODiNPred and its peers, but analysis shows that it retains greater accuracy for the more challenging prediction of intermediate disorder.

2020 ◽  
Vol 48 (W1) ◽  
pp. W77-W84 ◽  
Author(s):  
Patryk Jarnot ◽  
Joanna Ziemska-Legiecka ◽  
Laszlo Dobson ◽  
Matthew Merski ◽  
Pablo Mier ◽  
...  

Abstract Low complexity regions (LCRs) in protein sequences are characterized by a less diverse amino acid composition compared to typically observed sequence diversity. Recent studies have shown that LCRs may co-occur with intrinsically disordered regions, are highly conserved in many organisms, and often play important roles in protein functions and in diseases. In previous decades, several methods have been developed to identify regions with LCRs or amino acid bias, but most of them as stand-alone applications and currently there is no web-based tool which allows users to explore LCRs in protein sequences with additional functional annotations. We aim to fill this gap by providing PlaToLoCo - PLAtform of TOols for LOw COmplexity—a meta-server that integrates and collects the output of five different state-of-the-art tools for discovering LCRs and provides functional annotations such as domain detection, transmembrane segment prediction, and calculation of amino acid frequencies. In addition, the union or intersection of the results of the search on a query sequence can be obtained. By developing the PlaToLoCo meta-server, we provide the community with a fast and easily accessible tool for the analysis of LCRs with additional information included to aid the interpretation of the results. The PlaToLoCo platform is available at: http://platoloco.aei.polsl.pl/.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Maria Littmann ◽  
Michael Heinzinger ◽  
Christian Dallago ◽  
Tobias Olenyi ◽  
Burkhard Rost

AbstractKnowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.


2019 ◽  
Vol 73 (12) ◽  
pp. 713-725 ◽  
Author(s):  
Ruth Hendus-Altenburger ◽  
Catarina B. Fernandes ◽  
Katrine Bugge ◽  
Micha B. A. Kunze ◽  
Wouter Boomsma ◽  
...  

Abstract Phosphorylation is one of the main regulators of cellular signaling typically occurring in flexible parts of folded proteins and in intrinsically disordered regions. It can have distinct effects on the chemical environment as well as on the structural properties near the modification site. Secondary chemical shift analysis is the main NMR method for detection of transiently formed secondary structure in intrinsically disordered proteins (IDPs) and the reliability of the analysis depends on an appropriate choice of random coil model. Random coil chemical shifts and sequence correction factors were previously determined for an Ac-QQXQQ-NH2-peptide series with X being any of the 20 common amino acids. However, a matching dataset on the phosphorylated states has so far only been incompletely determined or determined only at a single pH value. Here we extend the database by the addition of the random coil chemical shifts of the phosphorylated states of serine, threonine and tyrosine measured over a range of pH values covering the pKas of the phosphates and at several temperatures (www.bio.ku.dk/sbinlab/randomcoil). The combined results allow for accurate random coil chemical shift determination of phosphorylated regions at any pH and temperature, minimizing systematic biases of the secondary chemical shifts. Comparison of chemical shifts using random coil sets with and without inclusion of the phosphoryl group, revealed under/over estimations of helicity of up to 33%. The expanded set of random coil values will improve the reliability in detection and quantification of transient secondary structure in phosphorylation-modified IDPs.


2020 ◽  
Vol 168 (1) ◽  
pp. 33-40
Author(s):  
Yuya Hirai ◽  
Eisuke Domae ◽  
Yoshihiro Yoshikawa ◽  
Keizo Tomonaga

Abstract The RNA helicase, DDX17 is a member of the DEAD-box protein family. DDX17 has two isoforms: p72 and p82. The p82 isoform has additional amino acid sequences called intrinsically disordered regions (IDRs), which are related to the formation of membraneless organelles (MLOs). Here, we reveal that p72 is mostly localized to the nucleoplasm, while p82 is localized to the nucleoplasm and nucleoli. Additionally, p82 exhibited slower intranuclear mobility than p72. Furthermore, the enzymatic mutants of both p72 and p82 accumulate into the stress granules. The enzymatic mutant of p82 abolishes nucleolar localization of p82. Our findings suggest the importance of IDRs and enzymatic activity of DEAD-box proteins in the intracellular distribution and formation of MLOs.


2019 ◽  
Vol 17 (01) ◽  
pp. 1950004 ◽  
Author(s):  
Chun Fang ◽  
Yoshitaka Moriwaki ◽  
Aikui Tian ◽  
Caihong Li ◽  
Kentaro Shimizu

Molecular recognition features (MoRFs) are key functional regions of intrinsically disordered proteins (IDPs), which play important roles in the molecular interaction network of cells and are implicated in many serious human diseases. Identifying MoRFs is essential for both functional studies of IDPs and drug design. This study adopts the cutting-edge machine learning method of artificial intelligence to develop a powerful model for improving MoRFs prediction. We proposed a method, named as en_DCNNMoRF (ensemble deep convolutional neural network-based MoRF predictor). It combines the outcomes of two independent deep convolutional neural network (DCNN) classifiers that take advantage of different features. The first, DCNNMoRF1, employs position-specific scoring matrix (PSSM) and 22 types of amino acid-related factors to describe protein sequences. The second, DCNNMoRF2, employs PSSM and 13 types of amino acid indexes to describe protein sequences. For both single classifiers, DCNN with a novel two-dimensional attention mechanism was adopted, and an average strategy was added to further process the output probabilities of each DCNN model. Finally, en_DCNNMoRF combined the two models by averaging their final scores. When compared with other well-known tools applied to the same datasets, the accuracy of the novel proposed method was comparable with that of state-of-the-art methods. The related web server can be accessed freely via http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/en_MoRFs.php .


2019 ◽  
Author(s):  
Taraneh Zarin ◽  
Bob Strome ◽  
Alex N Nguyen Ba ◽  
Simon Alberti ◽  
Julie D Forman-Kay ◽  
...  

AbstractIntrinsically disordered regions make up a large part of the proteome, but the sequence-to-function relationship in these regions is poorly understood, in part because the primary amino acid sequences of these regions are poorly conserved in alignments. Here we use an evolutionary approach to detect molecular features that are preserved in the amino acid sequences of orthologous intrinsically disordered regions. We find that most disordered regions contain multiple molecular features that are preserved, and we define these as “evolutionary signatures” of disordered regions. We demonstrate that intrinsically disordered regions with similar evolutionary signatures can rescue functionin vivo,and that groups of intrinsically disordered regions with similar evolutionary signatures are strongly enriched for functional annotations and phenotypes. We propose that evolutionary signatures can be used to predict function for many disordered regions from their amino acid sequences.


2017 ◽  
Author(s):  
Kamil Tamiola ◽  
Matthew M Heberling ◽  
Jan Domanski

AbstractAn overwhelming amount of experimental evidence suggests that elucidations of protein function, interactions, and pathology are incomplete without inclusion of intrinsic protein disorder and structural dynamics. Thus, to expand our understanding of intrinsic protein disorder, we have created a database of secondary structure (SS) propensities for proteins (dSPP) as a reference resource for experimental research and computational biophysics. The dSPP comprises SS propensities of 7,094 unrelated proteins, as gauged from NMR chemical shift measurements in solution and solid state. Here, we explain the concept of SS propensity and analyze dSPP entries of therapeutic relevance, α-synuclein, MOAG-4, and the ZIKA NS2B-NS3 complex to show: (1) how propensity mapping generates novel structural insights into intrinsically disordered regions of pathologically relevant proteins, (2) how computational biophysics tools can benefit from propensity mapping, and (3) how the residual disorder estimation based on NMR chemical shifts compares with sequence-based disorder predictors. This work demonstrates the benefit of propensity estimation as a method that reports both on protein structure, lability, and disorder.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Jérôme Tubiana ◽  
Simona Cocco ◽  
Rémi Monasson

Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families.


Sign in / Sign up

Export Citation Format

Share Document