Viewpoint-Aware Action Recognition using Skeleton-Based Features from Still Images

Seong-heum Kim; Donghyeon Cho

doi:10.3390/electronics10091118

Viewpoint-Aware Action Recognition using Skeleton-Based Features from Still Images

Electronics ◽

10.3390/electronics10091118 ◽

2021 ◽

Vol 10 (9) ◽

pp. 1118

Author(s):

Seong-heum Kim ◽

Donghyeon Cho

Keyword(s):

Neural Networks ◽

Action Recognition ◽

State Of The Art ◽

Recognition Method ◽

Action Classification ◽

Still Images ◽

Real World Application ◽

Static Images ◽

3D Skeleton ◽

2D And 3D

In this paper, we propose a viewpoint-aware action recognition method using skeleton-based features from static images. Our method consists of three main steps. First, we categorize the viewpoint from an input static image. Second, we extract 2D/3D joints using state-of-the-art convolutional neural networks and analyze the geometric relationships of the joints for computing 2D and 3D skeleton features. Finally, we perform view-specific action classification per person, based on viewpoint categorization and the extracted 2D and 3D skeleton features. We implement two multi-view data acquisition systems and create a new action recognition dataset containing the viewpoint labels, in order to train and validate our method. The robustness of the proposed method to viewpoint changes was quantitatively confirmed using two multi-view datasets. A real-world application for recognizing various actions was also qualitatively demonstrated.

Download Full-text

Action recognition based on 2D skeletons extracted from RGB videos

MATEC Web of Conferences ◽

10.1051/matecconf/201927702034 ◽

2019 ◽

Vol 277 ◽

pp. 02034

Author(s):

Sophie Aubry ◽

Sohaib Laraba ◽

Joëlle Tilmanne ◽

Thierry Dutoit

Keyword(s):

Neural Networks ◽

Image Classification ◽

Action Recognition ◽

State Of The Art ◽

Video Stream ◽

Motion Data ◽

Rgb Images ◽

Human Pose ◽

2D Images ◽

Made In

In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.

Download Full-text

FMnet: Iris Segmentation and Recognition by Using Fully and Multi-Scale CNN for Biometric Security

Applied Sciences ◽

10.3390/app9102042 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2042 ◽

Cited By ~ 6

Author(s):

Rachida Tobji ◽

Wu Di ◽

Naeem Ayoub

Keyword(s):

Neural Networks ◽

Iris Recognition ◽

State Of The Art ◽

Features Extraction ◽

Iris Segmentation ◽

Recognition Method ◽

Convolutional Network ◽

Fully Convolutional Network ◽

Multi Scale ◽

Biometric Security

In Deep Learning, recent works show that neural networks have a high potential in the field of biometric security. The advantage of using this type of architecture, in addition to being robust, is that the network learns the characteristic vectors by creating intelligent filters in an automatic way, grace to the layers of convolution. In this paper, we propose an algorithm “FMnet” for iris recognition by using Fully Convolutional Network (FCN) and Multi-scale Convolutional Neural Network (MCNN). By taking into considerations the property of Convolutional Neural Networks to learn and work at different resolutions, our proposed iris recognition method overcomes the existing issues in the classical methods which only use handcrafted features extraction, by performing features extraction and classification together. Our proposed algorithm shows better classification results as compared to the other state-of-the-art iris recognition approaches.

Download Full-text

Understanding the Gap between 2D and 3D Skeleton-Based Action Recognition

2019 IEEE International Symposium on Multimedia (ISM) ◽

10.1109/ism46123.2019.00041 ◽

2019 ◽

Cited By ~ 1

Author(s):

Petr Elias ◽

Jan Sedmidubsky ◽

Pavel Zezula

Keyword(s):

Action Recognition ◽

3D Skeleton ◽

2D And 3D

Download Full-text

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Applied Sciences ◽

10.3390/app10020557 ◽

2020 ◽

Vol 10 (2) ◽

pp. 557 ◽

Cited By ~ 4

Author(s):

Mei Chee Leong ◽

Dilip K. Prasad ◽

Yong Tsui Lee ◽

Feng Lin

Keyword(s):

Neural Networks ◽

Action Recognition ◽

3D Model ◽

3D Models ◽

Temporal Features ◽

Temporal Encoding ◽

Spatio Temporal ◽

Efficient Learning ◽

2D And 3D ◽

Temporal Learning

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16–30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.

Download Full-text

Skeleton-Based Action Recognition with Joint Coordinates as Feature Using Neural Oblivious Decision Ensembles

10.3233/faia210037 ◽

2021 ◽

Author(s):

Fakhrul Aniq Hakimi Nasrul ’Alam ◽

Mohd. Ibrahim Shapiai ◽

Uzma Batool ◽

Ahmad Kamal Ramli ◽

Khairil Ashraf Elias

Keyword(s):

Neural Networks ◽

Action Recognition ◽

Recurrent Neural Networks ◽

State Of The Art ◽

Structured Data ◽

Joint Coordinates ◽

Label Prediction ◽

Within Subjects ◽

High Degree ◽

Behaviour Recognition

Recognition of human behavior is critical in video monitoring, human-computer interaction, video comprehension, and virtual reality. The key problem with behaviour recognition in video surveillance is the high degree of variation between and within subjects. Numerous studies have suggested background-insensitive skeleton-based as the proven detection technique. The present state-of-the-art approaches to skeleton-based action recognition rely primarily on Recurrent Neural Networks (RNN) and Convolution Neural Networks (CNN). Both methods take dynamic human skeleton as the input to the network. We chose to handle skeleton data differently, relying solely on its skeleton joint coordinates as the input. The skeleton joints’ positions are defined in (x, y) coordinates. In this paper, we investigated the incorporation of the Neural Oblivious Decision Ensemble (NODE) into our proposed action classifier network. The skeleton is extracted using a pose estimation technique based on the Residual Network (ResNet). It extracts the 2D skeleton of 18 joints for each detected body. The joint coordinates of the skeleton are stored in a table in the form of rows and columns. Each row represents the position of the joints. The structured data are fed into NODE for label prediction. With the proposed network, we obtain 97.5% accuracy on RealWorld (HAR) dataset. Experimental results show that the proposed network outperforms one the state-of-the-art approaches by 1.3%. In conclusion, NODE is a promising deep learning technique for structured data analysis as compared to its machine learning counterparts such as the GBDT packages; Catboost, and XGBoost.

Download Full-text

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018545 ◽

2019 ◽

Vol 33 ◽

pp. 8545-8552 ◽

Cited By ~ 20

Author(s):

Dahun Kim ◽

Donghyeon Cho ◽

In So Kweon

Keyword(s):

Action Recognition ◽

Large Scale ◽

State Of The Art ◽

Representation Learning ◽

Space Time ◽

Temporal Relation ◽

Still Images ◽

Video Frames ◽

Spatio Temporal ◽

The Cost

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing Space-Time Cubic Puzzles, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

Download Full-text

Multiscale Convolutional Neural Networks for Hand Detection

Applied Computational Intelligence and Soft Computing ◽

10.1155/2017/9830641 ◽

2017 ◽

Vol 2017 ◽

pp. 1-13 ◽

Cited By ~ 10

Author(s):

Shiyang Yan ◽

Yizhang Xia ◽

Jeremy S. Smith ◽

Wenjin Lu ◽

Bailing Zhang

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Hand Tracking ◽

Language Recognition ◽

Hand Detection ◽

Still Images ◽

Detection Scheme ◽

Gesture Analysis

Unconstrained hand detection in still images plays an important role in many hand-related vision problems, for example, hand tracking, gesture analysis, human action recognition and human-machine interaction, and sign language recognition. Although hand detection has been extensively studied for decades, it is still a challenging task with many problems to be tackled. The contributing factors for this complexity include heavy occlusion, low resolution, varying illumination conditions, different hand gestures, and the complex interactions between hands and objects or other hands. In this paper, we propose a multiscale deep learning model for unconstrained hand detection in still images. Deep learning models, and deep convolutional neural networks (CNNs) in particular, have achieved state-of-the-art performances in many vision benchmarks. Developed from the region-based CNN (R-CNN) model, we propose a hand detection scheme based on candidate regions generated by a generic region proposal algorithm, followed by multiscale information fusion from the popular VGG16 model. Two benchmark datasets were applied to validate the proposed method, namely, the Oxford Hand Detection Dataset and the VIVA Hand Detection Challenge. We achieved state-of-the-art results on the Oxford Hand Detection Dataset and had satisfactory performance in the VIVA Hand Detection Challenge.

Download Full-text

Automatic Learning of Articulated Skeletons Based on Mean of 3D Joints for Efficient Action Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500082 ◽

2017 ◽

Vol 31 (04) ◽

pp. 1750008 ◽

Cited By ~ 3

Author(s):

Abdelouahid Ben Tamou ◽

Lahoucine Ballihi ◽

Driss Aboutajdine

Keyword(s):

Action Recognition ◽

Feature Vector ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Action Classification ◽

New Approach ◽

Temporal Aspects ◽

The Mean ◽

Mean Function

In this paper, we present a new approach for human action recognition using [Formula: see text] skeleton joints recovered from RGB-D cameras. We propose a descriptor based on differences of skeleton joints. This descriptor combines two characteristics including static posture and overall dynamics that encode spatial and temporal aspects. Then, we apply the mean function on these characteristics in order to form the feature vector, used as an input to Random Forest classifier for action classification. The experimental results on both datasets: MSR Action 3D dataset and MSR Daily Activity 3D dataset demonstrate that our approach is efficient and gives promising results compared to state-of-the-art approaches.

Download Full-text