A High Accurate Machine Learning Meta-Strategy for the Prediction of Intrinsically Disorder Proteins
Abstract Background: Many proteins or partial regions of proteins do not have stable and well-defined three-dimensional structures in vitro. Understanding Intrinsically Disorder Proteins (IDPs) is significant for interpreting biological function as well as studying many diseases. Although more than 70 disorder predictors have been invented, many existing predictors are limited on the characteristics of proteins and do not have very high accuracy. Therefore, it is critical to formulate new strategies on disorder protein prediction. Results: Here, we propose a machine learning meta-strategy to improve the accuracy of disordered proteins and disordered regions prediction. We first use logistic forward parameter selection to select eight most significant predictors from the current available IDP predictors. Then we design a novel meta-strategy using several machine learning models, including Decision-tree based algorithm, Naive Bayes, Random forest, and Convolutional Neural Network (CNN). By applying different strategies, the results suggest Random forest can improve the predicted single amino acid accuracy significantly to 93.35%. Using the combination vector data of eight most significant predictors as input, the Convolution Neural Network can improve the whole protein prediction to 95.62%. Conclusion: According to the performance of our machine learning meta-strategy, the Random forest and CNN models can improve the accuracy to predict IDPs.