scholarly journals MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5339
Author(s):  
Yan Sun ◽  
Yixin Shen ◽  
Liyan Ma

Skeleton-based human action recognition has made great progress, especially with the development of a graph convolution network (GCN). The most important work is ST-GCN, which automatically learns both spatial and temporal patterns from skeleton sequences. However, this method still has some imperfections: only short-range correlations are appreciated, due to the limited receptive field of graph convolution. However, long-range dependence is essential for recognizing human action. In this work, we propose the use of a spatial-temporal relative transformer (ST-RT) to overcome these defects. Through introducing relay nodes, ST-RT avoids the transformer architecture, breaking the inherent skeleton topology in spatial and the order of skeleton sequence in temporal dimensions. Furthermore, we mine the dynamic information contained in motion at different scales. Finally, four ST-RTs, which extract spatial-temporal features from four kinds of skeleton sequence, are fused to form the final model, multi-stream spatial-temporal relative transformer (MSST-RT), to enhance performance. Extensive experiments evaluate the proposed methods on three benchmarks for skeleton-based action recognition: NTU RGB+D, NTU RGB+D 120 and UAV-Human. The results demonstrate that MSST-RT is on par with SOTA in terms of performance.

Algorithms ◽  
2020 ◽  
Vol 13 (11) ◽  
pp. 301
Author(s):  
Guocheng Liu ◽  
Caixia Zhang ◽  
Qingyang Xu ◽  
Ruoshi Cheng ◽  
Yong Song ◽  
...  

In view of difficulty in application of optical flow based human action recognition due to large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5 convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and classification of human action is performed based on trained I3D-shufflenet model. The experimental results show that the shuffle layer improves the composition of features in each channel which can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG) spatial-temporal features of the object are extracted for training, which can significantly improve the ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet is testified on the UCF101 dataset, and compared with other models. The final result shows that the I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.


Data ◽  
2020 ◽  
Vol 5 (4) ◽  
pp. 104
Author(s):  
Ashok Sarabu ◽  
Ajit Kumar Santra

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.


Author(s):  
C. Indhumathi ◽  
V. Murugan ◽  
G. Muthulakshmii

Nowadays, action recognition has gained more attention from the computer vision community. Normally for recognizing human actions, spatial and temporal features are extracted. Two-stream convolutional neural network is used commonly for human action recognition in videos. In this paper, Adaptive motion Attentive Correlated Temporal Feature (ACTF) is used for temporal feature extractor. The temporal average pooling in inter-frame is used for extracting the inter-frame regional correlation feature and mean feature. This proposed method has better accuracy of 96.9% for UCF101 and 74.6% for HMDB51 datasets, respectively, which are higher than the other state-of-the-art methods.


IEEE Access ◽  
2018 ◽  
Vol 6 ◽  
pp. 17913-17922 ◽  
Author(s):  
Lei Wang ◽  
Yangyang Xu ◽  
Jun Cheng ◽  
Haiying Xia ◽  
Jianqin Yin ◽  
...  

Author(s):  
Haoze Wu ◽  
Jiawei Liu ◽  
Zheng-Jun Zha ◽  
Zhenzhong Chen ◽  
Xiaoyan Sun

Recent works use 3D convolutional neural networks to explore spatio-temporal information for human action recognition. However, they either ignore the correlation between spatial and temporal features or suffer from high computational cost by spatio-temporal features extraction. In this work, we propose a novel and efficient Mutually Reinforced Spatio-Temporal Convolutional Tube (MRST) for human action recognition. It decomposes 3D inputs into spatial and temporal representations, mutually enhances both of them by exploiting the interaction of spatial and temporal information and selectively emphasizes informative spatial appearance and temporal motion, meanwhile reducing the complexity of structure. Moreover, we design three types of MRSTs according to the different order of spatial and temporal information enhancement, each of which contains a spatio-temporal decomposition unit, a mutually reinforced unit and a spatio-temporal fusion unit. An end-to-end deep network, MRST-Net, is also proposed based on the MRSTs to better explore spatio-temporal information in human actions. Extensive experiments show MRST-Net yields the best performance, compared to state-of-the-art approaches.


Sign in / Sign up

Export Citation Format

Share Document