Human Action Recognition Based on a Spatio-Temporal Video Autoencoder

Due to rapid advances in the development of surveillance cameras with high sampling rates, low cost, small size and high resolution, video-based action recognition systems have become more commonly used in various computer vision applications. Human operators can be supported with the aid of such systems to detect events of interest in video sequences, improving recognition results and reducing failure cases. In this work, we propose and evaluate a method to learn two-dimensional (2D) representations from video sequences based on an autoencoder framework. Spatial and temporal information is explored through a multi-stream convolutional neural network in the context of human action recognition. Experimental results on the challenging UCF101 and HMDB51 datasets demonstrate that our representation is capable of achieving competitive accuracy rates when compared to other approaches available in the literature.

Download Full-text

Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition

Applied Sciences ◽

10.3390/app11114940 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4940

Author(s):

Jinsoo Kim ◽

Jeongho Cho

Keyword(s):

Embedded System ◽

Real Time ◽

Action Recognition ◽

Processing Speed ◽

Recognition Accuracy ◽

Low Cost ◽

Human Action Recognition ◽

Human Action ◽

Video Data ◽

Feature Maps

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.

Download Full-text

Human action recognition based on spatio-temporal three-dimensional scattering transform descriptor and an improved VLAD feature encoding algorithm

Neurocomputing ◽

10.1016/j.neucom.2018.05.121 ◽

2019 ◽

Vol 348 ◽

pp. 145-157 ◽

Cited By ~ 1

Author(s):

Bo Lin ◽

Bin Fang ◽

Weibin Yang ◽

Jiye Qian

Keyword(s):

Action Recognition ◽

Three Dimensional ◽

Human Action Recognition ◽

Human Action ◽

Scattering Transform ◽

Feature Encoding ◽

Spatio Temporal

Download Full-text

VIEW-ROBUST HUMAN ACTION RECOGNITION BASED ON SPATIO-TEMPORAL SELF SIMILARITIES

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.2020.01.00010 ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

K. Pradeep Reddy

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

MultiMedia Modeling - Lecture Notes in Computer Science ◽

10.1007/978-3-319-51811-4_30 ◽

2016 ◽

pp. 365-378 ◽

Cited By ~ 13

Author(s):

Ionut C. Duta ◽

Bogdan Ionescu ◽

Kiyoharu Aizawa ◽

Nicu Sebe

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Multimedia Tools and Applications ◽

10.1007/s11042-019-08587-w ◽

2020 ◽

Vol 79 (17-18) ◽

pp. 12349-12371

Author(s):

Qingshan She ◽

Gaoyuan Mu ◽

Haitao Gan ◽

Yingle Fan

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Context Aware ◽

Global Context ◽

Spatio Temporal

Download Full-text

Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Applied Sciences ◽

10.3390/app10124412 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4412

Author(s):

Ammar Mohsin Butt ◽

Muhammad Haroon Yousaf ◽

Fiza Murtaza ◽

Saima Nazir ◽

Serestina Viriri ◽

...

Keyword(s):

Action Recognition ◽

Feature Vector ◽

Human Action Recognition ◽

Human Action ◽

Compact Representation ◽

Agglomerative Clustering ◽

Residual Vector ◽

Benchmark Datasets ◽

Codebook Generation ◽

Spatio Temporal

Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.

Download Full-text