On the Depth of Deep Learning Models for Splice Site Identification
AbstractThe success of deep learning has been shown in various fields including computer vision, speech recognition, natural language processing and bioinformatics. The advance of Deep Learning in Computer Vision has been an important source of inspiration for other research fields. The objective of this work is to adapt known deep learning models borrowed from computer vision such as VGGNet, Resnet and AlexNet for the classification of biological sequences. In particular, we are interested by the task of splice site identification based on raw DNA sequences. We focus on the role of model architecture depth on model training and classification performance.We show that deep learning models outperform traditional classification methods (SVM, Random Forests, and Logistic Regression) for large training sets of raw DNA sequences. Three model families are analyzed in this work namely VGGNet, AlexNet and ResNet. Three depth levels are defined for each model family. The models are benchmarked using the following metrics: Area Under ROC curve (AUC), Number of model parameters, number of floating operations. Our extensive experimental evaluation show that shallow architectures have an overall better performance than deep models. We introduced a shallow version of ResNet, named S-ResNet. We show that it gives a good trade-off between model complexity and classification performance.Author summaryDeep Learning has been widely applied to various fields in research and industry. It has been also succesfully applied to genomics and in particular to splice site identification. We are interested in the use of advanced neural networks borrowed from computer vision. We explored well-known models and their usability for the problem of splice site identification from raw sequences. Our extensive experimental analysis shows that shallow models outperform deep models. We introduce a new model called S-ResNet, which gives a good trade-off between computational complexity and classification accuracy.