Improving human essential protein prediction using only protein sequences via ensemble learning

Author(s):  
Min Zeng ◽  
Nian Wang ◽  
Yifan Wu ◽  
Yiming Li ◽  
Fang-Xiang Wu ◽  
...  
2018 ◽  
Vol 17 (3) ◽  
pp. 243-250 ◽  
Author(s):  
Jiancheng Zhong ◽  
Yusui Sun ◽  
Wei Peng ◽  
Minzhu Xie ◽  
Jiahong Yang ◽  
...  

2020 ◽  
Author(s):  
Jian Zhang ◽  
Lixin Lv ◽  
Donglei Lu ◽  
Denan Kong ◽  
Mohammed Abdoh Ali Al-Alashaari ◽  
...  

Abstract Background: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.Results: Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset as a case, experiments are made to identify bacterial type IV secreted effectors from protein sequences, which indicates the effectiveness of our method. Conclusions: Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.


Author(s):  
Nian Wang ◽  
Min Zeng ◽  
Yiming Li ◽  
Fang-xiang Wu ◽  
Min Li

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Jian Zhang ◽  
Lixin Lv ◽  
Donglei Lu ◽  
Denan Kong ◽  
Mohammed Abdoh Ali Al-Alashaari ◽  
...  

Abstract Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 261 ◽  
Author(s):  
Wolfram Höps ◽  
Matt Jeffryes ◽  
Alex Bateman

We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation.  Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases.  We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes.  Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence’s likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio


Sign in / Sign up

Export Citation Format

Share Document