Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16–30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.

Download Full-text

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

10.20944/preprints201912.0086.v1 ◽

2019 ◽

Author(s):

Mei Chee Leong ◽

Dilip K. Prasad ◽

Yong Tsui Lee ◽

Feng Lin

Keyword(s):

Transfer Learning ◽

Action Recognition ◽

3D Model ◽

3D Models ◽

Temporal Features ◽

Temporal Encoding ◽

Spatio Temporal ◽

Efficient Learning ◽

2D And 3D ◽

Temporal Learning

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D CNNs, 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16 – 30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.

Download Full-text

Human Action Recognition by Learning Spatio-Temporal Features With Deep Neural Networks

IEEE Access ◽

10.1109/access.2018.2817253 ◽

2018 ◽

Vol 6 ◽

pp. 17913-17922 ◽

Cited By ~ 24

Author(s):

Lei Wang ◽

Yangyang Xu ◽

Jun Cheng ◽

Haiying Xia ◽

Jianqin Yin ◽

...

Keyword(s):

Neural Networks ◽

Action Recognition ◽

Deep Neural Networks ◽

Human Action Recognition ◽

Human Action ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

Skeleton-based action recognition using spatio-temporal features with convolutional neural networks

2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI) ◽

10.1109/kbei.2017.8324867 ◽

2017 ◽

Author(s):

Zahra Rostami ◽

Mahlagha Afrasiabi ◽

Hassan Khotanlou

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Action Recognition ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

CVPR 2011 ◽

10.1109/cvpr.2011.5995496 ◽

2011 ◽

Cited By ~ 473

Author(s):

Quoc V. Le ◽

Will Y. Zou ◽

Serena Y. Yeung ◽

Andrew Y. Ng

Keyword(s):

Action Recognition ◽

Subspace Analysis ◽

Temporal Features ◽

Spatio Temporal ◽

Independent Subspace Analysis

Download Full-text

Learning Spatio-Temporal Features for Action Recognition with Modified Hidden Conditional Random Field

Computer Vision - ECCV 2014 Workshops - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16178-5_55 ◽

2015 ◽

pp. 786-801

Author(s):

Wanru Xu ◽

Zhenjiang Miao ◽

Jian Zhang ◽

Yi Tian

Keyword(s):

Random Field ◽

Action Recognition ◽

Conditional Random Field ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

A Robust Approach for Action Recognition Based on Spatio-Temporal Features in RGB-D Sequences

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2016.070526 ◽

2016 ◽

Vol 7 (5) ◽

Author(s):

Ly Quoc ◽

Vo Hoai ◽

Tran Thai ◽

Pham Minh

Keyword(s):

Action Recognition ◽

Robust Approach ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

Multi-Term Attention Networks for Skeleton-Based Action Recognition

Applied Sciences ◽

10.3390/app10155326 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5326

Author(s):

Xiaolei Diao ◽

Xiaoqiang Li ◽

Chen Huang

Keyword(s):

Neural Network ◽

Time Scales ◽

Action Recognition ◽

State Of The Art ◽

Attention Networks ◽

Weighted Fusion ◽

Temporal Features ◽

Benchmark Datasets ◽

Spatio Temporal ◽

Different Time Scales

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.

Download Full-text