scholarly journals Natural protein sequences are more intrinsically disordered than random sequences

2016 ◽  
Vol 73 (15) ◽  
pp. 2949-2957 ◽  
Author(s):  
Jia-Feng Yu ◽  
Zanxia Cao ◽  
Yuedong Yang ◽  
Chun-Ling Wang ◽  
Zhen-Dong Su ◽  
...  
2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Maria Littmann ◽  
Michael Heinzinger ◽  
Christian Dallago ◽  
Tobias Olenyi ◽  
Burkhard Rost

AbstractKnowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.


PROTEOMICS ◽  
2018 ◽  
Vol 19 (6) ◽  
pp. 1800058 ◽  
Author(s):  
Ronesh Sharma ◽  
Alok Sharma ◽  
Gaurav Raicar ◽  
Tatsuhiko Tsunoda ◽  
Ashwini Patil

2019 ◽  
Vol 17 (01) ◽  
pp. 1950004 ◽  
Author(s):  
Chun Fang ◽  
Yoshitaka Moriwaki ◽  
Aikui Tian ◽  
Caihong Li ◽  
Kentaro Shimizu

Molecular recognition features (MoRFs) are key functional regions of intrinsically disordered proteins (IDPs), which play important roles in the molecular interaction network of cells and are implicated in many serious human diseases. Identifying MoRFs is essential for both functional studies of IDPs and drug design. This study adopts the cutting-edge machine learning method of artificial intelligence to develop a powerful model for improving MoRFs prediction. We proposed a method, named as en_DCNNMoRF (ensemble deep convolutional neural network-based MoRF predictor). It combines the outcomes of two independent deep convolutional neural network (DCNN) classifiers that take advantage of different features. The first, DCNNMoRF1, employs position-specific scoring matrix (PSSM) and 22 types of amino acid-related factors to describe protein sequences. The second, DCNNMoRF2, employs PSSM and 13 types of amino acid indexes to describe protein sequences. For both single classifiers, DCNN with a novel two-dimensional attention mechanism was adopted, and an average strategy was added to further process the output probabilities of each DCNN model. Finally, en_DCNNMoRF combined the two models by averaging their final scores. When compared with other well-known tools applied to the same datasets, the accuracy of the novel proposed method was comparable with that of state-of-the-art methods. The related web server can be accessed freely via http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/en_MoRFs.php .


Biochemistry ◽  
2018 ◽  
Vol 57 (35) ◽  
pp. 5270-5270
Author(s):  
Nicholas G. Housden ◽  
Patrice Rassam ◽  
Sejeong Lee ◽  
Firdaus Samsudin ◽  
Renata Kaminska ◽  
...  

2011 ◽  
Vol 44 (4) ◽  
pp. 467-518 ◽  
Author(s):  
H. Jane Dyson

AbstractProteins provide much of the scaffolding for life, as well as undertaking a variety of essential catalytic reactions. These characteristic functions have led us to presuppose that proteins are in general functional only when well structured and correctly folded. As we begin to explore the repertoire of possible protein sequences inherent in the human and other genomes, two stark facts that belie this supposition become clear: firstly, the number of apparent open reading frames in the human genome is significantly smaller than appears to be necessary to code for all of the diverse proteins in higher organisms, and secondly that a significant proportion of the protein sequences that would be coded by the genome would not be expected to form stable three-dimensional (3D) structures. Clearly the genome must include coding for a multitude of alternative forms of proteins, some of which may be partly or fully disordered or incompletely structured in their functional states. At the same time as this likelihood was recognized, experimental studies also began to uncover examples of important protein molecules and domains that were incompletely structured or completely disordered in solution, yet remained perfectly functional. In the ensuing years, we have seen an explosion of experimental and genome-annotation studies that have mapped the extent of the intrinsic disorder phenomenon and explored the possible biological rationales for its widespread occurrence. Answers to the question ‘why would a particular domain need to be unstructured?’ are as varied as the systems where such domains are found. This review provides a survey of recent new directions in this field, and includes an evaluation of the role not only of intrinsically disordered proteins but also of partially structured and highly dynamic members of the disorder–order continuum.


2019 ◽  
Author(s):  
Laura Weidmann ◽  
Tjeerd Dijkstra ◽  
Oliver Kohlbacher ◽  
Andrei Lupas

AbstractBiological sequences are the product of natural selection, raising the expectation that they differ substantially from random sequences. We test this expectation by analyzing all fragments of a given length derived from either a natural dataset or different random models. For this, we compile all distances in sequence space between fragments within each dataset and compare the resulting distance distributions between sets. Even for 100mers, 95.4% of all distances between natural fragments are in accordance with those of a random model incorporating the natural residue composition. Hence, natural sequences are distributed almost randomly in global sequence space. When further accounting for the specific residue composition of domain-sized fragments, 99.2% of all distances between natural fragments can be modeled. Local residue composition, which might reflect biophysical constraints on protein structure, is thus the predominant feature characterizing distances between natural sequences globally, whereas homologous effects are only barely detectable.


2016 ◽  
Author(s):  
Sergei Spirin

There are a lot of algorithms and programs for reconstruction of phylogeny of a set of proteins basing on multiple sequence alignment. Many programs allow users to choose a number of parameters, for example, a model for maximum likelihood programs. Different programs and different parameters often produce different results. However at the moment all published benchmarks for evaluation of relative accuracy of programs or different choices of parameters are based on simulated sequences. The aim of the present work is to create a benchmark that allows a comparison of phylogenetic programs on large sets of alignments of natural protein sequences.


Sign in / Sign up

Export Citation Format

Share Document