Abstract
Extensive research effort has been focused on extracting temporal patterns from videos, to improve the accuracy of video classification using a deep neural network based approaches. In this paper, we show that long term dependency patterns may not be enough to achieve sufficient improved results. We propose the Attention-based Spatio-Temporal model (AST) for video classification, which is a self-attention model that learns to attend to spatial features using Convolutional Neural Network (CNN) and temporal features using attention mechanisms. We evaluate our model on motion dependent Action recognition (UCF-101) dataset, facial expression recognition (MMI) dataset, and micro-expression recognition (CASME2) dataset and generated real-life Facial Expression Recognition (FER) dataset and improved by 10%, 4.7% and 5.6% accuracy respectively as compared to state-of-art on the three standard datasets and a synthetic dataset as well.In our research, we performed several experiments for detecting expressions and actions, the AST model plays a vital role in selecting the frames and carry the sequential context in the real-time application as well. We also experimented by extracting the features using the Active shape model (ASM) for FER and found the AST model surpasses other approaches.