Global Co-Occurrence Feature and Local Spatial Feature Learning for Skeleton-Based Action Recognition

Recent progress on skeleton-based action recognition has been substantial, benefiting mostly from the explosive development of Graph Convolutional Networks (GCN). However, prevailing GCN-based methods may not effectively capture the global co-occurrence features among joints and the local spatial structure features composed of adjacent bones. They also ignore the effect of channels unrelated to action recognition on model performance. Accordingly, to address these issues, we propose a Global Co-occurrence feature and Local Spatial feature learning model (GCLS) consisting of two branches. The first branch, based on the Vertex Attention Mechanism branch (VAM-branch), captures the global co-occurrence feature of actions effectively; the second, based on the Cross-kernel Feature Fusion branch (CFF-branch), extracts local spatial structure features composed of adjacent bones and restrains the channels unrelated to action recognition. Extensive experiments on two large-scale datasets, NTU-RGB+D and Kinetics, demonstrate that GCLS achieves the best performance when compared to the mainstream approaches.

Download Full-text

Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition

Electronics ◽

10.3390/electronics10182198 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2198

Author(s):

Chaoyue Li ◽

Lian Zou ◽

Cien Fan ◽

Hao Jiang ◽

Yifeng Liu

Keyword(s):

Action Recognition ◽

Large Scale ◽

Feature Learning ◽

Superior Performance ◽

Sparse Graph ◽

Convolutional Network ◽

Convolutional Networks ◽

Spatial Graph ◽

Multi Stage ◽

The Time Domain

Graph convolutional networks (GCNs), which model human actions as a series of spatial-temporal graphs, have recently achieved superior performance in skeleton-based action recognition. However, the existing methods mostly use the physical connections of joints to construct a spatial graph, resulting in limited topological information of the human skeleton. In addition, the action features in the time domain have not been fully explored. To better extract spatial-temporal features, we propose a multi-stage attention-enhanced sparse graph convolutional network (MS-ASGCN) for skeleton-based action recognition. To capture more abundant joint dependencies, we propose a new strategy for constructing skeleton graphs. This simulates bidirectional information flows between neighboring joints and pays greater attention to the information transmission between sparse joints. In addition, a part attention mechanism is proposed to learn the weight of each part and enhance the part-level feature learning. We introduce multiple streams of different stages and merge them in specific layers of the network to further improve the performance of the model. Our model is finally verified on two large-scale datasets, namely NTU-RGB+D and Skeleton-Kinetics. Experiments demonstrate that the proposed MS-ASGCN outperformed the previous state-of-the-art methods on both datasets.

Download Full-text

Graph CNNs with Motif and Variable Temporal Block for Skeleton-Based Action Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018989 ◽

2019 ◽

Vol 33 ◽

pp. 8989-8996 ◽

Cited By ~ 11

Author(s):

Yu-Hui Wen ◽

Lin Gao ◽

Hongbo Fu ◽

Fang-Lue Zhang ◽

Shihong Xia

Keyword(s):

Spatial Structure ◽

Action Recognition ◽

Large Scale ◽

Human Skeleton ◽

Semantic Roles ◽

Temporal Domain ◽

Non Local ◽

Novel Model ◽

Local Block ◽

Skeleton Structure

Hierarchical structure and different semantic roles of joints in human skeleton convey important information for action recognition. Conventional graph convolution methods for modeling skeleton structure consider only physically connected neighbors of each joint, and the joints of the same type, thus failing to capture highorder information. In this work, we propose a novel model with motif-based graph convolution to encode hierarchical spatial structure, and a variable temporal dense block to exploit local temporal information over different ranges of human skeleton sequences. Moreover, we employ a non-local block to capture global dependencies of temporal domain in an attention mechanism. Our model achieves improvements over the stateof-the-art methods on two large-scale datasets.

Download Full-text

Feature fusion-based collaborative learning for knowledge distillation

International Journal of Distributed Sensor Networks ◽

10.1177/15501477211057037 ◽

2021 ◽

Vol 17 (11) ◽

pp. 155014772110570

Author(s):

Yiting Li ◽

Liyuan Sun ◽

Jianping Gou ◽

Lan Du ◽

Weihua Ou

Keyword(s):

Collaborative Learning ◽

Large Scale ◽

Feature Fusion ◽

Model Performance ◽

Data Sets ◽

Great Success ◽

Compression Technique ◽

Model Compression ◽

Knowledge Distillation ◽

Self Driving Cars

Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.

Download Full-text

New colour fusion deep learning model for large-scale action recognition

International Journal of Computational Vision and Robotics ◽

10.1504/ijcvr.2020.10025967 ◽

2020 ◽

Vol 10 (1) ◽

pp. 41

Author(s):

Abhishek Verma ◽

Yukhe Lavinia ◽

Holly Vo

Keyword(s):

Deep Learning ◽

Action Recognition ◽

Large Scale ◽

Learning Model ◽

Deep Learning Model

Download Full-text

Efficient Facial Feature Learning with Wide Ensemble-Based Convolutional Neural Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6037 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5800-5809 ◽

Cited By ~ 2

Author(s):

Henrique Siqueira ◽

Sven Magg ◽

Stefan Wermter

Keyword(s):

Large Scale ◽

Feature Learning ◽

Ensemble Methods ◽

Facial Feature ◽

Expression Recognition ◽

Generalization Error ◽

Processing Efficiency ◽

Convolutional Networks ◽

Ensemble Performance ◽

In The Wild

Ensemble methods, traditionally built with independently trained de-correlated models, have proven to be efficient methods for reducing the remaining residual generalization error, which results in robust and accurate methods for real-world applications. In the context of deep learning, however, training an ensemble of deep networks is costly and generates high redundancy which is inefficient. In this paper, we present experiments on Ensembles with Shared Representations (ESRs) based on convolutional networks to demonstrate, quantitatively and qualitatively, their data processing efficiency and scalability to large-scale datasets of facial expressions. We show that redundancy and computational load can be dramatically reduced by varying the branching level of the ESR without loss of diversity and generalization power, which are both important for ensemble performance. Experiments on large-scale datasets suggest that ESRs reduce the remaining residual generalization error on the AffectNet and FER+ datasets, reach human-level performance, and outperform state-of-the-art methods on facial expression recognition in the wild using emotion and affect concepts.

Download Full-text

Feature fusion for human action recognition based on classical descriptors and 3D convolutional networks

2017 Eleventh International Conference on Sensing Technology (ICST) ◽

10.1109/icsenst.2017.8304460 ◽

2017 ◽

Author(s):

Yang Qin ◽

Lingfei Mo ◽

Benyi Xie

Keyword(s):

Action Recognition ◽

Feature Fusion ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Networks

Download Full-text

GAS-GCN: Gated Action-Specific Graph Convolutional Networks for Skeleton-Based Action Recognition

Sensors ◽

10.3390/s20123499 ◽

2020 ◽

Vol 20 (12) ◽

pp. 3499 ◽

Cited By ~ 3

Author(s):

Wensong Chan ◽

Zhiqiang Tian ◽

Yang Wu

Keyword(s):

Action Recognition ◽

Human Body ◽

Adjacency Matrix ◽

Large Scale ◽

Redundant Information ◽

Temporal Dimension ◽

Convolutional Networks ◽

Effective Operation ◽

Public Datasets ◽

Body Joints

Skeleton-based action recognition has achieved great advances with the development of graph convolutional networks (GCNs). Many existing GCNs-based models only use the fixed hand-crafted adjacency matrix to describe the connections between human body joints. This omits the important implicit connections between joints, which contain discriminative information for different actions. In this paper, we propose an action-specific graph convolutional module, which is able to extract the implicit connections and properly balance them for each action. In addition, to filter out the useless and redundant information in the temporal dimension, we propose a simple yet effective operation named gated temporal convolution. These two major novelties ensure the superiority of our proposed method, as demonstrated on three large-scale public datasets: NTU-RGB + D, Kinetics, and NTU-RGB + D 120, and also shown in the detailed ablation studies.

Download Full-text

Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/109 ◽

2018 ◽

Cited By ~ 56

Author(s):

Chao Li ◽

Qiaoyong Zhong ◽

Di Xie ◽

Shiliang Pu

Keyword(s):

Action Recognition ◽

Large Scale ◽

Contextual Information ◽

Semantic Representation ◽

Feature Learning ◽

Human Action Recognition ◽

Human Action ◽

The Arts ◽

Level Information ◽

Local Aggregation

Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.

Download Full-text

Hierarchy Spatial-Temporal Transformer for Action Recognition in Short Videos

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200754 ◽

2020 ◽

Author(s):

Guoyong Cai ◽

Yumeng Cai

Keyword(s):

Spatial Attention ◽

Action Recognition ◽

Feature Learning ◽

Feature Maps ◽

Spatial Feature ◽

Feature Map ◽

Proposed Model ◽

Stream Architecture ◽

3D Cnn ◽

Transformer Model

Short videos action recognition based on deep learning has made a series of important progress; most of the proposed methods are based on 3D Convolution neural networks (3D CNN) and Two Stream architecture. However, 3D CNN has a large number of parameters and Two Stream networks cannot learn features well enough. This work aims to build a network to learn better features and reduce the scale of parameters. A Hierarchy Spatial-Temporal Transformer model is proposed, which is based on Two Stream architecture and hierarchy inference. The model is divided into three modules: Hierarchy Residual Reformer, Spatial Attention Module, and Temporal-Spatial Attention Module. In the model, each frame’s image is firstly transformed into a spatial visual feature map. Secondly, spatial feature learning is performed by spatial attention to generating attention spatial feature maps. Finally, the generated attention spatial feature map is incorporated with temporal feature vectors to generate a final representation for classification experiments. Experiment results in the hmdb51 and ucf101 data set showed that the proposed model achieved better accuracy than the state-of-art baseline models

Download Full-text

Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

Applied Sciences ◽

10.3390/app10041482 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1482 ◽

Cited By ~ 2

Author(s):

Jiuqing Dong ◽

Yongbin Gao ◽

Hyo Jong Lee ◽

Heng Zhou ◽

Yifan Yao ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Feature Fusion ◽

State Of The Art ◽

Recognition Task ◽

High Order ◽

Recognition Method ◽

Related Research ◽

Convolutional Networks ◽

Temporal Features

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance.

Download Full-text