Multi-function Prediction of Unknown Protein Sequences Using Multilabel Classifiers and Augmented Sequence Features

Author(s):  
Saurabh Agrawal ◽  
Dilip Singh Sisodia ◽  
Naresh Kumar Nagwani
Author(s):  
H.M.Fazlul Haque ◽  
Muhammod Rafsanjani ◽  
Fariha Arifin ◽  
Sheikh Adilina ◽  
Swakkhar Shatabda

Genes ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 1264
Author(s):  
Stavros Makrodimitris ◽  
Roeland C. H. J. van Ham ◽  
Marcel J. T. Reinders

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.


Author(s):  
Shinji Chiba ◽  
◽  
Ken Sugawara ◽  

The function of unknown proteins is currently most effective determined by retrieving similar known sequences. Some effective techniques involve sequence retrieval. We propose retrieval using a finite state automaton (FSA). The FSA is created with accumulated amino acid residue scores that express a property of a protein family. We calculate the similarity of known and unknown protein sequences using the FSA and used it to determine protein functions. To improve accuracy, we optimized the FSA using a genetic algorithm. Results from determining protein functions indicated that our proposal was superior to general motif analysis.


2021 ◽  
Author(s):  
Grzegorz Chojnowski ◽  
Adam J. Simpkin ◽  
Diego A. Leonardo ◽  
Wolfram Seifert-Davila ◽  
Dan E. Vivas-Ruiz ◽  
...  

AbstractAlthough experimental protein structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or as a contaminant. Regardless of the source of the problem, the unknown protein always requires tedious characterization. Here we present an automated pipeline for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. We present the method’s application to characterize the crystal structure of an unknown protein purified from a snake venom. We also show that the approach can be successfully applied to the identification of protein sequences and validation of sequence assignments in cryo-EM protein structures.


2021 ◽  
Author(s):  
Irene van den Bent ◽  
Stavros Makrodimitris ◽  
Marcel Reinders

AbstractComputationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labelled protein training data. A recently published supervised molecular function predicting model partly circumvents this limitation by making its predictions based on the universal (i.e. task-agnostic) contextualised protein embeddings from the deep pre-trained unsupervised protein language model SeqVec. SeqVec embeddings incorporate contextual information of amino acids, thereby modelling the underlying principles of protein sequences insensitive to the context of species.We applied the existing SeqVec-based molecular function prediction model in a transfer learning task by training the model on annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalises knowledge about protein function from one eukaryotic species to various other species, proving itself an effective method for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms. Furthermore, we submitted the performance of our SeqVec-based prediction models to detailed characterisation, first to advance the understanding of protein language models and second to determine areas of improvement.Author summaryProteins are diverse molecules that regulate all processes in biology. The field of synthetic biology aims to understand these protein functions to solve problems in medicine, manufacturing, and agriculture. Unfortunately, for many proteins only their amino acid sequence is known whereas their function remains unknown. Only a few species have been well-studied such as mouse, human and yeast. Hence, we need to increase knowledge on protein functions. Doing so is, however, complicated as determining protein functions experimentally is time-consuming, expensive, and technically limited. Computationally predicting protein functions offers a faster and more scalable approach but is hampered as it requires much data to design accurate function prediction algorithms. Here, we show that it is possible to computationally generalize knowledge on protein function from one well-studied training species to another test species. Additionally, we show that the quality of these protein function predictions depends on how structurally similar the proteins are between the species. Advantageously, the predictors require only the annotations of proteins from the training species and mere amino acid sequences of test species which may particularly benefit the function prediction of species from understudied taxonomic kingdoms such as the Plantae, Protozoa and Chromista.


2021 ◽  
Vol 17 ◽  
pp. 117693432110626
Author(s):  
Irene van den Bent ◽  
Stavros Makrodimitris ◽  
Marcel Reinders

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.


2017 ◽  
Vol 86 (2) ◽  
pp. 135-151 ◽  
Author(s):  
Ahmet Sureyya Rifaioglu ◽  
Tunca Doğan ◽  
Ömer Sinan Saraç ◽  
Tulin Ersahin ◽  
Rabie Saidi ◽  
...  

IUCrJ ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Grzegorz Chojnowski ◽  
Adam J. Simpkin ◽  
Diego A. Leonardo ◽  
Wolfram Seifert-Davila ◽  
Dan E. Vivas-Ruiz ◽  
...  

Although experimental protein-structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or appear as a contaminant. Regardless of the source of the problem, the unknown protein always requires characterization. Here, an automated pipeline is presented for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. The method's application to characterize the crystal structure of an unknown protein purified from a snake venom is presented. It is also shown that the approach can be successfully applied to the identification of protein sequences and validation of sequence assignments in cryo-EM protein structures.


Sign in / Sign up

Export Citation Format

Share Document