A Large Scale RGB-D Dataset for Action Recognition

Author(s):  
Jing Zhang ◽  
Wanqing Li ◽  
Pichao Wang ◽  
Philip Ogunbona ◽  
Song Liu ◽  
...  
2021 ◽  
Vol 11 (10) ◽  
pp. 4426
Author(s):  
Chunyan Ma ◽  
Ji Fan ◽  
Jinghao Yao ◽  
Tao Zhang

Computer vision-based action recognition of basketball players in basketball training and competition has gradually become a research hotspot. However, owing to the complex technical action, diverse background, and limb occlusion, it remains a challenging task without effective solutions or public dataset benchmarks. In this study, we defined 32 kinds of atomic actions covering most of the complex actions for basketball players and built the dataset NPU RGB+D (a large scale dataset of basketball action recognition with RGB image data and Depth data captured in Northwestern Polytechnical University) for 12 kinds of actions of 10 professional basketball players with 2169 RGB+D videos and 75 thousand frames, including RGB frame sequences, depth maps, and skeleton coordinates. Through extracting the spatial features of the distances and angles between the joint points of basketball players, we created a new feature-enhanced skeleton-based method called LSTM-DGCN for basketball player action recognition based on the deep graph convolutional network (DGCN) and long short-term memory (LSTM) methods. Many advanced action recognition methods were evaluated on our dataset and compared with our proposed method. The experimental results show that the NPU RGB+D dataset is very competitive with the current action recognition algorithms and that our LSTM-DGCN outperforms the state-of-the-art action recognition methods in various evaluation criteria on our dataset. Our action classifications and this NPU RGB+D dataset are valuable for basketball player action recognition techniques. The feature-enhanced LSTM-DGCN has a more accurate action recognition effect, which improves the motion expression ability of the skeleton data.


2020 ◽  
Vol 34 (07) ◽  
pp. 12862-12869
Author(s):  
Shiwen Zhang ◽  
Sheng Guo ◽  
Limin Wang ◽  
Weilin Huang ◽  
Matthew Scott

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8% top-1 accuracy.


2020 ◽  
Vol 34 (03) ◽  
pp. 2669-2676 ◽  
Author(s):  
Wei Peng ◽  
Xiaopeng Hong ◽  
Haoyu Chen ◽  
Guoying Zhao

Human action recognition from skeleton data, fuelled by the Graph Convolutional Network (GCN) with its powerful capability of modeling non-Euclidean data, has attracted lots of attention. However, many existing GCNs provide a pre-defined graph structure and share it through the entire network, which can loss implicit joint correlations especially for the higher-level features. Besides, the mainstream spectral GCN is approximated by one-order hop such that higher-order connections are not well involved. All of these require huge efforts to design a better GCN architecture. To address these problems, we turn to Neural Architecture Search (NAS) and propose the first automatically designed GCN for this task. Specifically, we explore the spatial-temporal correlations between nodes and build a search space with multiple dynamic graph modules. Besides, we introduce multiple-hop modules and expect to break the limitation of representational capacity caused by one-order approximation. Moreover, a corresponding sampling- and memory-efficient evolution strategy is proposed to search in this space. The resulted architecture proves the effectiveness of the higher-order approximation and the layer-wise dynamic graph modules. To evaluate the performance of the searched model, we conduct extensive experiments on two very large scale skeleton-based action recognition datasets. The results show that our model gets the state-of-the-art results in term of given metrics.


2020 ◽  
Vol 34 (03) ◽  
pp. 2677-2684
Author(s):  
Marjaneh Safaei ◽  
Pooyan Balouchian ◽  
Hassan Foroosh

Action recognition in still images poses a great challenge due to (i) fewer available training data, (ii) absence of temporal information. To address the first challenge, we introduce a dataset for STill image Action Recognition (STAR), containing over $1M$ images across 50 different human body-motion action categories. UCF-STAR is the largest dataset in the literature for action recognition in still images. The key characteristics of UCF-STAR include (1) focusing on human body-motion rather than relatively static human-object interaction categories, (2) collecting images from the wild to benefit from a varied set of action representations, (3) appending multiple human-annotated labels per image rather than just the action label, and (4) inclusion of rich, structured and multi-modal set of metadata for each image. This departs from existing datasets, which typically provide single annotation in a smaller number of images and categories, with no metadata. UCF-STAR exposes the intrinsic difficulty of action recognition through its realistic scene and action complexity. To benchmark and demonstrate the benefits of UCF-STAR as a large-scale dataset, and to show the role of “latent” motion information in recognizing human actions in still images, we present a novel approach relying on predicting temporal information, yielding higher accuracy on 5 widely-used datasets.


Author(s):  
Yu-Hui Wen ◽  
Lin Gao ◽  
Hongbo Fu ◽  
Fang-Lue Zhang ◽  
Shihong Xia

Hierarchical structure and different semantic roles of joints in human skeleton convey important information for action recognition. Conventional graph convolution methods for modeling skeleton structure consider only physically connected neighbors of each joint, and the joints of the same type, thus failing to capture highorder information. In this work, we propose a novel model with motif-based graph convolution to encode hierarchical spatial structure, and a variable temporal dense block to exploit local temporal information over different ranges of human skeleton sequences. Moreover, we employ a non-local block to capture global dependencies of temporal domain in an attention mechanism. Our model achieves improvements over the stateof-the-art methods on two large-scale datasets.


Sensors ◽  
2020 ◽  
Vol 20 (11) ◽  
pp. 3305 ◽  
Author(s):  
Huogen Wang ◽  
Zhanjie Song ◽  
Wanqing Li ◽  
Pichao Wang

The paper presents a novel hybrid network for large-scale action recognition from multiple modalities. The network is built upon the proposed weighted dynamic images. It effectively leverages the strengths of the emerging Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches to specifically address the challenges that occur in large-scale action recognition and are not fully dealt with by the state-of-the-art methods. Specifically, the proposed hybrid network consists of a CNN based component and an RNN based component. Features extracted by the two components are fused through canonical correlation analysis and then fed to a linear Support Vector Machine (SVM) for classification. The proposed network achieved state-of-the-art results on the ChaLearn LAP IsoGD, NTU RGB+D and Multi-modal & Multi-view & Interactive ( M 2 I ) datasets and outperformed existing methods by a large margin (over 10 percentage points in some cases).


2019 ◽  
Vol 1 (11) ◽  
pp. 530-537 ◽  
Author(s):  
Piotr Antonik ◽  
Nicolas Marsal ◽  
Daniel Brunner ◽  
Damien Rontani

2020 ◽  
Author(s):  
Kai J. Sandbrink ◽  
Pranav Mamidanna ◽  
Claudio Michaelis ◽  
Mackenzie Weygandt Mathis ◽  
Matthias Bethge ◽  
...  

Biological motor control is versatile and efficient. Muscles are flexible and undergo continuous changes requiring distributed adaptive control mechanisms. How proprioception solves this problem in the brain is unknown. Here we pursue a task-driven modeling approach that has provided important insights into other sensory systems. However, unlike for vision and audition where large annotated datasets of raw images or sound are readily available, data of relevant proprioceptive stimuli are not. We generated a large-scale dataset of human arm trajectories as the hand is tracing the alphabet in 3D space, then using a musculoskeletal model derived the spindle firing rates during these movements. We propose an action recognition task that allows training of hierarchical models to classify the character identity from the spindle firing patterns. Artificial neural networks could robustly solve this task, and the networks’ units show directional movement tuning akin to neurons in the primate somatosensory cortex. The same architectures with random weights also show similar kinematic feature tuning but do not reproduce the diversity of preferred directional tuning nor do they have invariant tuning across 3D space. Taken together our model is the first to link tuning properties in the proprioceptive system to the behavioral level.HighlightsWe provide a normative approach to derive neural tuning of proprioceptive features from behaviorally-defined objectives.We propose a method for creating a scalable muscle spindles dataset based on kinematic data and define an action recognition task as a benchmark.Hierarchical neural networks solve the recognition task from muscle spindle inputs.Individual neural network units in middle layers resemble neurons in primate somatosensory cortex & make predictions for neurons along the proprioceptive pathway.


Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5613
Author(s):  
Amirreza Farnoosh ◽  
Zhouping Wang ◽  
Shaotong Zhu ◽  
Sarah Ostadabbas

We introduce a generative Bayesian switching dynamical model for action recognition in 3D skeletal data. Our model encodes highly correlated skeletal data into a few sets of low-dimensional switching temporal processes and from there decodes to the motion data and their associated action labels. We parameterize these temporal processes with regard to a switching deep autoregressive prior to accommodate both multimodal and higher-order nonlinear inter-dependencies. This results in a dynamical deep generative latent model that parses meaningful intrinsic states in skeletal dynamics and enables action recognition. These sequences of states provide visual and quantitative interpretations about motion primitives that gave rise to each action class, which have not been explored previously. In contrast to previous works, which often overlook temporal dynamics, our method explicitly model temporal transitions and is generative. Our experiments on two large-scale 3D skeletal datasets substantiate the superior performance of our model in comparison with the state-of-the-art methods. Specifically, our method achieved 6.3% higher action classification accuracy (by incorporating a dynamical generative framework), and 3.5% better predictive error (by employing a nonlinear second-order dynamical transition model) when compared with the best-performing competitors.


Sign in / Sign up

Export Citation Format

Share Document