Feature representation is of vital importance for human action recognition.
In recent few years, the application of deep learning in action recognition
has become popular. However, for action recognition in videos, the advantage
of single convolution feature over traditional methods is not so evident. In
this paper, a novel feature representation that combines spatial and
temporal feature with global motion information is proposed. Specifically,
spatial and temporal feature from RGB images is extracted by convolutional
neural network (CNN) and long short-term memory (LSTM) network. On the other
hand, global motion information extracted from motion difference images using
another separate CNN. Hereby, the motion difference images are binary video
frames processed by exclusive or (XOR). Finally, support vector machine
(SVM) is adopted as classifier. Experimental results on YouTube Action and
UCF-50 show the superiority of the proposed method.