Spatio-Temporal Attention Networks for Action Recognition and Detection

2020 ◽  
Vol 22 (11) ◽  
pp. 2990-3001 ◽  
Author(s):  
Jun Li ◽  
Xianglong Liu ◽  
Wenxuan Zhang ◽  
Mingyuan Zhang ◽  
Jingkuan Song ◽  
...  
2019 ◽  
Vol 21 (2) ◽  
pp. 416-428 ◽  
Author(s):  
Dong Li ◽  
Ting Yao ◽  
Ling-Yu Duan ◽  
Tao Mei ◽  
Yong Rui

Author(s):  
Chunyu Xie ◽  
Ce Li ◽  
Baochang Zhang ◽  
Chen Chen ◽  
Jungong Han ◽  
...  

Skeleton-based action recognition task is entangled with complex spatio-temporal variations of skeleton joints, and remains challenging for Recurrent Neural Networks (RNNs). In this work, we propose a temporal-then-spatial recalibration scheme to alleviate such complex variations, resulting in an end-to-end Memory Attention Networks (MANs) which consist of a Temporal Attention Recalibration Module (TARM) and a Spatio-Temporal Convolution Module (STCM). Specifically, the TARM is deployed in a residual learning module that employs a novel attention learning network to recalibrate the temporal attention of frames in a skeleton sequence. The STCM treats the attention calibrated skeleton joint sequences as images and leverages the Convolution Neural Networks (CNNs) to further model the spatial and temporal information of skeleton data. These two modules (TARM and STCM) seamlessly form a single network architecture that can be trained in an end-to-end fashion. MANs significantly boost the performance of skeleton-based action recognition and achieve the best results on four challenging benchmark datasets: NTU RGB+D, HDM05, SYSU-3D and UT-Kinect.


2020 ◽  
Vol 10 (15) ◽  
pp. 5326
Author(s):  
Xiaolei Diao ◽  
Xiaoqiang Li ◽  
Chen Huang

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.


Author(s):  
Zhou Zhao ◽  
Qifan Yang ◽  
Deng Cai ◽  
Xiaofei He ◽  
Yueting Zhuang

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 88604-88616 ◽  
Author(s):  
Yun Han ◽  
Sheng-Luen Chung ◽  
Qiang Xiao ◽  
Wei You Lin ◽  
Shun-Feng Su

Sensors ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 1005
Author(s):  
Pau Climent-Pérez ◽  
Francisco Florez-Revuelta

The potential benefits of recognising activities of daily living from video for active and assisted living have yet to be fully untapped. These technologies can be used for behaviour understanding, and lifelogging for caregivers and end users alike. The recent publication of realistic datasets for this purpose, such as the Toyota Smarthomes dataset, calls for pushing forward the efforts to improve action recognition. Using the separable spatio-temporal attention network proposed in the literature, this paper introduces a view-invariant normalisation of skeletal pose data and full activity crops for RGB data, which improve the baseline results by 9.5% (on the cross-subject experiments), outperforming state-of-the-art techniques in this field when using the original unmodified skeletal data in dataset. Our code and data are available online.


2018 ◽  
Vol 27 (7) ◽  
pp. 3459-3471 ◽  
Author(s):  
Sijie Song ◽  
Cuiling Lan ◽  
Junliang Xing ◽  
Wenjun Zeng ◽  
Jiaying Liu

Sign in / Sign up

Export Citation Format

Share Document