Prediction of protein subcellular localization using deep learning and data augmentation
AbstractIdentifying subcellular localization of protein is significant for understanding its molecular function. It provides valuable insights that can be of tremendous help to protein’s function research and the detection of potential cell surface/secreted drug targets. The prediction of protein subcellular localization using bioinformatics methods is an inexpensive option to experimentally approaches. Many computational tools have been built during the past two decades, however, producing reliable prediction has always been the challenge. In this study, a Deep learning (DL) technique is proposed to enhance the precision of the analytical engine of one of these tools called PSORTb v3.0. Its conventional SVM machine learning model was replaced by the state-of-the-art DL method (BiLSTM) and a Data augmentation measure (SeqGAN). As a result, the combination of BiLSTM and SeqGAN outperformed SVM by improving its precision from 57.4% to 75%. This method was applied on a dataset containing 8230 protein sequences, which was experimentally derived by Brinkman Lab. The presented model provides promising outcomes for the future research. The source code of the model is available at https://github.com/mgetech/SubLoc.