RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet

Action recognition is an important research direction of computer vision, whose performance based on video images is easily affected by factors such as background and light, while deep video images can better reduce interference and improve recognition accuracy. Therefore, this paper makes full use of video and deep skeleton data and proposes an RGB-D action recognition based two-stream network (SV-GCN), which can be described as a two-stream architecture that works with two different data. Proposed Nonlocal-stgcn (S-Stream) based on skeleton data, by adding nonlocal to obtain dependency relationship between a wider range of joints, to provide more rich skeleton point features for the model, proposed a video based Dilated-slowfastnet (V-Stream), which replaces traditional random sampling layer with dilated convolutional layers, which can make better use of depth the feature; finally, two stream information is fused to realize action recognition. The experimental results on NTU-RGB+D dataset show that proposed method significantly improves recognition accuracy and is superior to st-gcn and Slowfastnet in both CS and CV.

Download Full-text

Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition

Applied Sciences ◽

10.3390/app11114940 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4940

Author(s):

Jinsoo Kim ◽

Jeongho Cho

Keyword(s):

Embedded System ◽

Real Time ◽

Action Recognition ◽

Processing Speed ◽

Recognition Accuracy ◽

Low Cost ◽

Human Action Recognition ◽

Human Action ◽

Video Data ◽

Feature Maps

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.

Download Full-text

Human Action Recognition Algorithm Based on Key Posture

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.631-632.1303 ◽

2013 ◽

Vol 631-632 ◽

pp. 1303-1308

Author(s):

He Jin Yuan

Keyword(s):

Action Recognition ◽

Clustering Algorithm ◽

Recognition Accuracy ◽

Human Action Recognition ◽

Human Action ◽

Recognition Algorithm ◽

Training Samples ◽

Action Sequences

A novel human action recognition algorithm based on key posture is proposed in this paper. In the method, the mesh features of each image in human action sequences are firstly calculated; then the key postures of the human mesh features are generated through k-medoids clustering algorithm; and the motion sequences are thus represented as vectors of key postures. The component of the vector is the occurrence number of the corresponding posture included in the action. For human action recognition, the observed action is firstly changed into key posture vector; then the correlevant coefficients to the training samples are calculated and the action which best matches the observed sequence is chosen as the final category. The experiments on Weizmann dataset demonstrate that our method is effective for human action recognition. The average recognition accuracy can exceed 90%.

Download Full-text

Extraction and Recognition Method of Basketball Players’ Dynamic Human Actions Based on Deep Learning

Mobile Information Systems ◽

10.1155/2021/4437146 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Qiulin Wang ◽

Baole Tao ◽

Fulei Han ◽

Wenting Wei

Keyword(s):

Deep Learning ◽

Action Recognition ◽

Recognition Accuracy ◽

Human Action Recognition ◽

Human Action ◽

Recognition Algorithm ◽

Basketball Players ◽

Human Actions ◽

Basketball Game ◽

Wide Range

The extraction and recognition of human actions has always been a research hotspot in the field of state recognition. It has a wide range of application prospects in many fields. In sports, it can reduce the occurrence of accidental injuries and improve the training level of basketball players. How to extract effective features from the dynamic body movements of basketball players is of great significance. In order to improve the fairness of the basketball game, realize the accurate recognition of the athletes’ movements, and simultaneously improve the level of the athletes and regulate the movements of the athletes during training, this article uses deep learning to extract and recognize the movements of the basketball players. This paper implements human action recognition algorithm based on deep learning. This method automatically extracts image features through convolution kernels, which greatly improves the efficiency compared with traditional manual feature extraction methods. This method uses the deep convolutional neural network VGG model on the TensorFlow platform to extract and recognize human actions. On the Matlab platform, the KTH and Weizmann datasets are preprocessed to obtain the input image set. Then, the preprocessed dataset is used to train the model to obtain the optimal network model and corresponding data by testing the two datasets. Finally, the two datasets are analyzed in detail, and the specific cause of each action confusion is given. Simultaneously, the recognition accuracy and average recognition accuracy rates of each action category are calculated. The experimental results show that the human action recognition algorithm based on deep learning obtains a higher recognition accuracy rate.

Download Full-text

Feature Retrieving for Human Action Recognition by Mixed Scale Deep Feature Combined with Attention Model

2020 5th International Conference on Computer and Communication Systems (ICCCS) ◽

10.1109/icccs49078.2020.9118516 ◽

2020 ◽

Author(s):

Xiaolei Zhao ◽

Yang Yi ◽

Zemin Qiu ◽

Qingqing Zeng

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Attention Model ◽

Deep Feature

Download Full-text

Research on Human Action Recognition in Dance Video Images

Journal of Physics Conference Series ◽

10.1088/1742-6596/1852/2/022062 ◽

2021 ◽

Vol 1852 (2) ◽

pp. 022062

Author(s):

Hua He

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Video Images

Download Full-text

Multi-View Hierarchical Bidirectional Recurrent Neural Network for Depth Video Sequence Based Action Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001418500337 ◽

2018 ◽

Vol 32 (10) ◽

pp. 1850033 ◽

Cited By ~ 25

Author(s):

Xueping Liu ◽

Yibo Li ◽

Qingjun Wang

Keyword(s):

Action Recognition ◽

Video Sequence ◽

Single Layer ◽

Human Action Recognition ◽

Research Direction ◽

Human Action ◽

Depth Image ◽

Classification Framework ◽

Depth Video ◽

High Computational Efficiency

Human action recognition based on depth video sequence is an important research direction in the field of computer vision. The present study proposed a classification framework based on hierarchical multi-view to resolve depth video sequence-based action recognition. Herein, considering the distinguishing feature of 3D human action space, we project the 3D human action image to three coordinate planes, so that the 3D depth image is converted to three 2D images, and then feed them to three subnets, respectively. With the increase of the number of layers, the representations of subnets are hierarchically fused to be the inputs of next layers. The final representations of the depth video sequence are fed into a single layer perceptron, and the final result is decided by the time accumulated through the output of the perceptron. We compare with other methods on two publicly available datasets, and we also verify the proposed method through the human action database acquired by our Kinect system. Our experimental results demonstrate that our model has high computational efficiency and achieves the performance of state-of-the-art method.

Download Full-text

Action Recognition Using a Spatial-Temporal Network for Wild Felines

Animals ◽

10.3390/ani11020485 ◽

2021 ◽

Vol 11 (2) ◽

pp. 485

Author(s):

Liqi Feng ◽

Yaqin Zhao ◽

Yichao Sun ◽

Wenxuan Zhao ◽

Jiaxi Tang

Keyword(s):

Behavior Analysis ◽

Action Recognition ◽

Video Clip ◽

Human Action Recognition ◽

Human Action ◽

Knee Joints ◽

Temporal Part ◽

Stream Network ◽

Temporal Features ◽

Static Action

Behavior analysis of wild felines has significance for the protection of a grassland ecological environment. Compared with human action recognition, fewer researchers have focused on feline behavior analysis. This paper proposes a novel two-stream architecture that incorporates spatial and temporal networks for wild feline action recognition. The spatial portion outlines the object region extracted by Mask region-based convolutional neural network (R-CNN) and builds a Tiny Visual Geometry Group (VGG) network for static action recognition. Compared with VGG16, the Tiny VGG network can reduce the number of network parameters and avoid overfitting. The temporal part presents a novel skeleton-based action recognition model based on the bending angle fluctuation amplitude of the knee joints in a video clip. Due to its temporal features, the model can effectively distinguish between different upright actions, such as standing, ambling, and galloping, particularly when the felines are occluded by objects such as plants, fallen trees, and so on. The experimental results showed that the proposed two-stream network model can effectively outline the wild feline targets in captured images and can significantly improve the performance of wild feline action recognition due to its spatial and temporal features.

Download Full-text

Searching Human Action Recognition Accuracy from Depth Video Sequences Using HOG and PHOG Shape Features

Advances In Image and Video Processing ◽

10.14738/aivp.65.5340 ◽

2018 ◽

Vol 6 (5) ◽

Author(s):

Mohammad Farhad Bulbul

Keyword(s):

Action Recognition ◽

Recognition Accuracy ◽

Human Action Recognition ◽

Human Action ◽

Video Sequences ◽

Shape Features ◽

Depth Video

Download Full-text

Composite Feature Vector Assisted Human Action Recognition through Supervised Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7337.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1556-1566

Keyword(s):

Feature Extraction ◽

Action Recognition ◽

Human Action Recognition ◽

Research Direction ◽

Human Action ◽

Support Vector ◽

Complex Wavelet Transform ◽

Simulation Experiments ◽

Composite Feature ◽

New Feature

Human Action Recognition is a key research direction and also a trending topic in several fields like machine learning, computer vision and other fields. The main objective of this research is to recognize the human action in image of video. However, the existing approaches have many limitations like low recognition accuracy and non-robustness. Hence, this paper focused to develop a novel and robust Human Action Recognition framework. In this framework, we proposed a new feature extraction technique based on the Gabor Transform and Dual Tree Complex Wavelet Transform. These two feature extraction techniques helps in the extraction of perfect discriminative features by which the actions present in the image or video are correctly recognized. Later, the proposed framework accomplished the Support Vector Machine algorithm as a classifier. Simulation experiments are conducted over two standard datasets such as KTH and Weizmann. Experimental results reveal that the proposed framework achieves better performance compared to state-of-art recognition methods.

Download Full-text

Visual Sensing Human Motion Detection System for Interactive Music Teaching

Journal of Sensors ◽

10.1155/2021/2311594 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Xunyun Chang ◽

Liangqing Peng

Keyword(s):

Action Recognition ◽

Recognition Accuracy ◽

Human Action Recognition ◽

Recognition System ◽

Human Action ◽

Music Teaching ◽

Interactive Teaching ◽

Teaching Mode ◽

Dual Channel ◽

3D Cnn

The purpose is to study the interactive teaching mode of human action recognition technology in music and dance teaching under computer vision. The human action detection and recognition system based on a three-dimensional (3D) convolutional neural network (CNN) is established. Then, a human action recognition model based on the dual channel is proposed on the basis of CNN, and the visual attention mechanism using the interframe differential channel is introduced into the model. Through experiments, the performance of the system in the process of human dance image recognition based on the Kungliga Tekniska Högskolan (KTH) dataset is verified. The results show that the dual-channel 3D CNN human action recognition system can achieve high accuracy in the first few rounds of training after the frame difference channel is added, the error can be reduced quickly, and the convergence can start quickly; the recognition accuracy of the system on KTH dataset is 96.6%, which is higher than that of other methods; for 3 × 3 × 3 basic convolution kernel, the best performance of the classification network can be obtained by pushing forward 0.0091 seconds in the calculation. Thereby, the dual-channel 3D CNN recognition system has good human action recognition accuracy in the dance interactive teaching mode of music teaching.

Download Full-text