Human Action Recognition Based on a Spatio-Temporal Video Autoencoder
Due to rapid advances in the development of surveillance cameras with high sampling rates, low cost, small size and high resolution, video-based action recognition systems have become more commonly used in various computer vision applications. Human operators can be supported with the aid of such systems to detect events of interest in video sequences, improving recognition results and reducing failure cases. In this work, we propose and evaluate a method to learn two-dimensional (2D) representations from video sequences based on an autoencoder framework. Spatial and temporal information is explored through a multi-stream convolutional neural network in the context of human action recognition. Experimental results on the challenging UCF101 and HMDB51 datasets demonstrate that our representation is capable of achieving competitive accuracy rates when compared to other approaches available in the literature.