A Spatio-Temporal Feature Descriptor for Action Recognition using Feature Relations

Author(s):  
A. Zweng ◽  
M. Kampel
Author(s):  
C. Indhumathi ◽  
V. Murugan ◽  
G. Muthulakshmii

Nowadays, action recognition has gained more attention from the computer vision community. Normally for recognizing human actions, spatial and temporal features are extracted. Two-stream convolutional neural network is used commonly for human action recognition in videos. In this paper, Adaptive motion Attentive Correlated Temporal Feature (ACTF) is used for temporal feature extractor. The temporal average pooling in inter-frame is used for extracting the inter-frame regional correlation feature and mean feature. This proposed method has better accuracy of 96.9% for UCF101 and 74.6% for HMDB51 datasets, respectively, which are higher than the other state-of-the-art methods.


Author(s):  
Bo Lin ◽  
Bin Fang

Automatic human action recognition is a core functionality of systems for video surveillance and human object interaction. In the whole recognition system, feature description and encoding represent two crucial key steps. In order to construct a powerful action recognition framework, it is important that the two steps must provide reliable performance. In this paper, we proposed a new human action feature descriptor which is called spatio-temporal histograms of gradients (SPHOG). SPHOG is based on the spatial and temporal derivation signal, which extracts the gradient changes between consecutive frames. Compared to the traditional descriptors histograms of optical flow, our proposed SPHOG costs less computation resource. In order to incorporate the distribution information of local descriptors into Vector of Locally Aggregated Descriptors (VLAD), which is a popular encoding approach for Bag-of-Feature representation, a Gaussian kernel is implanted to compute the weighted distance histograms of local descriptors. By doing this, the encoding schema for bag-of-feature (BOF) representation is more effective. We validated our proposed algorithm for human action recognition on three public available datasets KTH, UCF Sports and HMDB51. The evaluation experiment results indicate that the proposed descriptor and encoding method can improve the efficiency of human action recognition and the recognition accuracy.


2021 ◽  
Author(s):  
Yongkang Huang ◽  
Meiyu Liang

Abstract Inspired by the wide application of transformer in computer vision and its excellent ability in temporal feature learning. This paper proposes a novel and efficient spatio-temporal residual attention network for student action recognition in classroom teaching video. It first fuses 2D spatial convolution and 1D temporal convolution to study spatio-temporal feature, then combines the powerful Reformer to better study the deeper spatio-temporal characteristics with visual significance of student classroom action. Based on the spatio-temporal residual attention network, a single person action recognition model in classroom teaching video is proposed. Considering that there are often multiple students in the classroom video scene, on the basis of single person action recognition, combined with object detection and tracking technology, the association of temporal and spatial characteristics of the same student targets is established, so as to realize the multi-student action recognition in classroom video scene. The experimental results on classroom teaching video dataset and public video dataset show that the proposed model achieves higher action recognition performance than the existing excellent models and methods.


Sign in / Sign up

Export Citation Format

Share Document