SubFeat: Feature Subspacing Ensemble Classifier for Function Prediction of DNA, RNA and Protein Sequences

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.

Download Full-text

The power of universal contextualised protein embeddings in cross-species protein function prediction

10.1101/2021.04.19.440461 ◽

2021 ◽

Author(s):

Irene van den Bent ◽

Stavros Makrodimitris ◽

Marcel Reinders

Keyword(s):

Amino Acid ◽

Protein Function ◽

Prediction Models ◽

Protein Sequences ◽

Function Prediction ◽

Training Data ◽

Language Models ◽

Molecular Function ◽

Test Species ◽

Protein Functions

AbstractComputationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labelled protein training data. A recently published supervised molecular function predicting model partly circumvents this limitation by making its predictions based on the universal (i.e. task-agnostic) contextualised protein embeddings from the deep pre-trained unsupervised protein language model SeqVec. SeqVec embeddings incorporate contextual information of amino acids, thereby modelling the underlying principles of protein sequences insensitive to the context of species.We applied the existing SeqVec-based molecular function prediction model in a transfer learning task by training the model on annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalises knowledge about protein function from one eukaryotic species to various other species, proving itself an effective method for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms. Furthermore, we submitted the performance of our SeqVec-based prediction models to detailed characterisation, first to advance the understanding of protein language models and second to determine areas of improvement.Author summaryProteins are diverse molecules that regulate all processes in biology. The field of synthetic biology aims to understand these protein functions to solve problems in medicine, manufacturing, and agriculture. Unfortunately, for many proteins only their amino acid sequence is known whereas their function remains unknown. Only a few species have been well-studied such as mouse, human and yeast. Hence, we need to increase knowledge on protein functions. Doing so is, however, complicated as determining protein functions experimentally is time-consuming, expensive, and technically limited. Computationally predicting protein functions offers a faster and more scalable approach but is hampered as it requires much data to design accurate function prediction algorithms. Here, we show that it is possible to computationally generalize knowledge on protein function from one well-studied training species to another test species. Additionally, we show that the quality of these protein function predictions depends on how structurally similar the proteins are between the species. Advantageously, the predictors require only the annotations of proteins from the training species and mere amino acid sequences of test species which may particularly benefit the function prediction of species from understudied taxonomic kingdoms such as the Plantae, Protozoa and Chromista.

Download Full-text

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Evolutionary Bioinformatics ◽

10.1177/11769343211062608 ◽

2021 ◽

Vol 17 ◽

pp. 117693432110626

Author(s):

Irene van den Bent ◽

Stavros Makrodimitris ◽

Marcel Reinders

Keyword(s):

Protein Function ◽

Contextual Information ◽

Protein Function Prediction ◽

Protein Sequences ◽

Learning Task ◽

Function Prediction ◽

Training Data ◽

Language Models ◽

Molecular Function ◽

Test Species

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

Download Full-text

Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.25416 ◽

2017 ◽

Vol 86 (2) ◽

pp. 135-151 ◽

Cited By ~ 3

Author(s):

Ahmet Sureyya Rifaioglu ◽

Tunca Doğan ◽

Ömer Sinan Saraç ◽

Tulin Ersahin ◽

Rabie Saidi ◽

...

Keyword(s):

Large Scale ◽

Protein Sequences ◽

Function Prediction ◽

Transcript Variants

Download Full-text

Implementation of Protein Sequence Classification for Globin family using Ensemble Learnin

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/18942021 ◽

2021 ◽

Vol 9 (4) ◽

pp. 441-445

Keyword(s):

Feature Extraction ◽

Drug Discovery ◽

Protein Sequence ◽

Learning Algorithm ◽

Protein Sequences ◽

Ensemble Classifier ◽

Important Task ◽

Sequence Classification ◽

Feature Vectors ◽

Protein Sequence Classification

Feature Extraction from protein sequence is a very important task in bioinformatics. The main focus of that work is protein sequences classification that can be used to improve drug discovery and identification of diseases for treating patients in the early stages of diagnosis. In this paper, we proposed a method which is used for feature extraction i.e. converting the protein sequence of hemoglobin in to feature vectors. The feature vectors are then given to the ensemble classifier as an input which uses various classifier to provide better result/performance as compared to any constituent learning algorithm alone.

Download Full-text

Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2019.2911609 ◽

2019 ◽

pp. 1-1 ◽

Cited By ~ 1

Author(s):

Ashish Ranjan ◽

Md Shah Fahad ◽

David Fernandez-Baca ◽

Akshay Deepak ◽

Sudhakar Tripathi

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Protein Sequences ◽

Function Prediction ◽

Variable Length ◽

Robust Framework

Download Full-text

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Advances in Intelligent Systems and Computing - Recent Findings in Intelligent Computing Techniques ◽

10.1007/978-981-10-8636-6_18 ◽

2018 ◽

pp. 167-174

Author(s):

Aditya Sriram ◽

Mounica Sanapala ◽

Ronak Patel ◽

Nagamma Patil

Keyword(s):

Feature Selection ◽

Protein Sequences ◽

Ensemble Classifier ◽

Selection Strategy

Download Full-text

SubFeat: Feature Subspacing Ensemble Classifier for Function Prediction of DNA, RNA and Protein Sequences

10.1101/2020.08.04.228536 ◽

2020 ◽

Author(s):

H.M.Fazlul Haque ◽

Fariha Arifin ◽

Sheikh Adilina ◽

Muhammod Rafsanjani ◽

Swakkhar Shatabda

Keyword(s):

Genetic Material ◽

Protein Sequences ◽

Feature Space ◽

Ensemble Classifier ◽

Majority Voting ◽

Ensemble Classification ◽

Ensemble Classifiers ◽

Ribonucleic Acids ◽

Recent Developments

AbstractThe information of a cell is primarily contained in Deoxyribonucleic Acid (DNA). There is a flow of information of DNA to protein sequences via Ribonucleic acids (RNA) through transcription and translation. These entities are vital for the genetic process. Recent developments in epigenetic also show the importance of the genetic material and knowledge of their attributes and functions. However, the growth in known attributes or functionalities of these entities are still in slow progression due to the time consuming and expensive in vitro experimental methods. In this paper, we have proposed an ensemble classification algorithm called SubFeat to predict the functionalities of biological entities from different types of datasets. Our model uses a feature subspace based novel ensemble method. It divides the feature space into sub-spaces which are then passed to learn individual classifier models and the ensemble is built on this base classifiers that uses a weighted majority voting mechanism. SubFeat tested on four datasets comprising two DNA, one RNA and one protein dataset and it outperformed all the existing single classifiers and as well as the ensemble classifiers. SubFeat is made availalbe as a Python-based tool. We have made the package SubFeat available online along with a user manual. It is freely accessible from here: https://github.com/fazlulhaquejony/SubFeat.

Download Full-text

Neural network and random forest models in protein function prediction

10.1101/690271 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kai Hakala ◽

Suwisa Kaewphan ◽

Jari Björne ◽

Farrokh Mehryary ◽

Hans Moen ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Protein Function ◽

Protein Function Prediction ◽

Protein Sequences ◽

Function Prediction ◽

Learning System ◽

Large Set ◽

Competitive Performance

AbstractOver the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence.We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data.In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software athttps://github.com/TurkuNLP/CAFA3Author summaryUnderstanding the role and function of proteins in biological processes is fundamental for new biological discoveries. Whereas modern sequencing methods have led to a rapid growth of protein databases, the function of these sequences is often unknown and expensive to determine experimentally. This has spurred a lot of interest in predictive modelling of protein functions.We develop a machine learning system for annotating protein sequences with functional definitions selected from a vast set of predefined functions. The approach is based on a combination of neural network and random forest classifiers with features covering structural and taxonomic properties and sequence similarity. The system is thoroughly evaluated on a large set of manually curated functional annotations and shows competitive performance in comparison to other suggested approaches. We also analyze the predictions for different functional annotation and taxonomy categories and measure the importance of different features for the task. This analysis reveals that the system is particularly efficient for bacterial protein sequences.

Download Full-text