Abstract
Background: Moonlighting proteins (MPs) are a subclass of multifunctional proteins in which more than one independent or usually distinct function occurs in a single polypeptide chain. Identification of unknown cellular processes, understanding novel protein mechanisms, improving the prediction of protein functions, and gaining information about protein evolution are the main reasons to study MPs. They also play an important role in disease pathways and drug-target discovery. Since detecting MPs experimentally is quite a challenge, most of them were detected randomly. Therefore, introducing an appropriate computational approach seems to be rational. Results: In this study, we would like to represent a competent model for detecting moonlighting and non-moonlighting proteins by extracted features from protein sequences. Then, we will represent a scheme for detecting outlier proteins. To do so, 15 distinct feature vectors were used to study each one's effect on detecting MPs. Furthermore, 8 different classification methods were assessed to find the best performance. To detect outliers, each one of the classifications was implemented 100 times by 10 fold cross-validation on feature vectors, then proteins which miss classified 80 times or more, were grouped. This process was applied to every single feature vector and in the end, the intersection of these groups was determined as the outlier proteins. The results of 10 fold cross-validation on a dataset of 351 samples (containing 215 moonlighting and 136 non-moonlighting proteins) show that the decision tree method on all feature vectors has the highest performance among all methods in this research and also in other available methods. Besides, the study of outliers shows that 57 of 351 proteins in the dataset could be an appropriate candidate for the outlier. Among the outlier proteins, there are non-moonlighting proteins (such as P69797) that have been misclassified by 8 different classification methods with 16 different feature types. Because these moonlighting proteins have been obtained by computational methods, the results of this study could reduce the likelihood of hypothesizing that, these proteins are non-moonlighting. Conclusions: Moonlighting proteins are difficult to identify by experiments. Our method enables identification of novel moonlighting proteins using distinct feature vectors. It also indicates that a number of non-moonlight proteins are likely to be moonlight.