scholarly journals Global Co-Occurrence Feature and Local Spatial Feature Learning for Skeleton-Based Action Recognition

Entropy ◽  
2020 ◽  
Vol 22 (10) ◽  
pp. 1135
Author(s):  
Jun Xie ◽  
Wentian Xin ◽  
Ruyi Liu ◽  
Qiguang Miao ◽  
Lijie Sheng ◽  
...  

Recent progress on skeleton-based action recognition has been substantial, benefiting mostly from the explosive development of Graph Convolutional Networks (GCN). However, prevailing GCN-based methods may not effectively capture the global co-occurrence features among joints and the local spatial structure features composed of adjacent bones. They also ignore the effect of channels unrelated to action recognition on model performance. Accordingly, to address these issues, we propose a Global Co-occurrence feature and Local Spatial feature learning model (GCLS) consisting of two branches. The first branch, based on the Vertex Attention Mechanism branch (VAM-branch), captures the global co-occurrence feature of actions effectively; the second, based on the Cross-kernel Feature Fusion branch (CFF-branch), extracts local spatial structure features composed of adjacent bones and restrains the channels unrelated to action recognition. Extensive experiments on two large-scale datasets, NTU-RGB+D and Kinetics, demonstrate that GCLS achieves the best performance when compared to the mainstream approaches.

Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2198
Author(s):  
Chaoyue Li ◽  
Lian Zou ◽  
Cien Fan ◽  
Hao Jiang ◽  
Yifeng Liu

Graph convolutional networks (GCNs), which model human actions as a series of spatial-temporal graphs, have recently achieved superior performance in skeleton-based action recognition. However, the existing methods mostly use the physical connections of joints to construct a spatial graph, resulting in limited topological information of the human skeleton. In addition, the action features in the time domain have not been fully explored. To better extract spatial-temporal features, we propose a multi-stage attention-enhanced sparse graph convolutional network (MS-ASGCN) for skeleton-based action recognition. To capture more abundant joint dependencies, we propose a new strategy for constructing skeleton graphs. This simulates bidirectional information flows between neighboring joints and pays greater attention to the information transmission between sparse joints. In addition, a part attention mechanism is proposed to learn the weight of each part and enhance the part-level feature learning. We introduce multiple streams of different stages and merge them in specific layers of the network to further improve the performance of the model. Our model is finally verified on two large-scale datasets, namely NTU-RGB+D and Skeleton-Kinetics. Experiments demonstrate that the proposed MS-ASGCN outperformed the previous state-of-the-art methods on both datasets.


Author(s):  
Yu-Hui Wen ◽  
Lin Gao ◽  
Hongbo Fu ◽  
Fang-Lue Zhang ◽  
Shihong Xia

Hierarchical structure and different semantic roles of joints in human skeleton convey important information for action recognition. Conventional graph convolution methods for modeling skeleton structure consider only physically connected neighbors of each joint, and the joints of the same type, thus failing to capture highorder information. In this work, we propose a novel model with motif-based graph convolution to encode hierarchical spatial structure, and a variable temporal dense block to exploit local temporal information over different ranges of human skeleton sequences. Moreover, we employ a non-local block to capture global dependencies of temporal domain in an attention mechanism. Our model achieves improvements over the stateof-the-art methods on two large-scale datasets.


2021 ◽  
Vol 17 (11) ◽  
pp. 155014772110570
Author(s):  
Yiting Li ◽  
Liyuan Sun ◽  
Jianping Gou ◽  
Lan Du ◽  
Weihua Ou

Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.


2020 ◽  
Vol 34 (04) ◽  
pp. 5800-5809 ◽  
Author(s):  
Henrique Siqueira ◽  
Sven Magg ◽  
Stefan Wermter

Ensemble methods, traditionally built with independently trained de-correlated models, have proven to be efficient methods for reducing the remaining residual generalization error, which results in robust and accurate methods for real-world applications. In the context of deep learning, however, training an ensemble of deep networks is costly and generates high redundancy which is inefficient. In this paper, we present experiments on Ensembles with Shared Representations (ESRs) based on convolutional networks to demonstrate, quantitatively and qualitatively, their data processing efficiency and scalability to large-scale datasets of facial expressions. We show that redundancy and computational load can be dramatically reduced by varying the branching level of the ESR without loss of diversity and generalization power, which are both important for ensemble performance. Experiments on large-scale datasets suggest that ESRs reduce the remaining residual generalization error on the AffectNet and FER+ datasets, reach human-level performance, and outperform state-of-the-art methods on facial expression recognition in the wild using emotion and affect concepts.


Sensors ◽  
2020 ◽  
Vol 20 (12) ◽  
pp. 3499 ◽  
Author(s):  
Wensong Chan ◽  
Zhiqiang Tian ◽  
Yang Wu

Skeleton-based action recognition has achieved great advances with the development of graph convolutional networks (GCNs). Many existing GCNs-based models only use the fixed hand-crafted adjacency matrix to describe the connections between human body joints. This omits the important implicit connections between joints, which contain discriminative information for different actions. In this paper, we propose an action-specific graph convolutional module, which is able to extract the implicit connections and properly balance them for each action. In addition, to filter out the useless and redundant information in the temporal dimension, we propose a simple yet effective operation named gated temporal convolution. These two major novelties ensure the superiority of our proposed method, as demonstrated on three large-scale public datasets: NTU-RGB + D, Kinetics, and NTU-RGB + D 120, and also shown in the detailed ablation studies.


Author(s):  
Chao Li ◽  
Qiaoyong Zhong ◽  
Di Xie ◽  
Shiliang Pu

Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.


Author(s):  
Guoyong Cai ◽  
Yumeng Cai

Short videos action recognition based on deep learning has made a series of important progress; most of the proposed methods are based on 3D Convolution neural networks (3D CNN) and Two Stream architecture. However, 3D CNN has a large number of parameters and Two Stream networks cannot learn features well enough. This work aims to build a network to learn better features and reduce the scale of parameters. A Hierarchy Spatial-Temporal Transformer model is proposed, which is based on Two Stream architecture and hierarchy inference. The model is divided into three modules: Hierarchy Residual Reformer, Spatial Attention Module, and Temporal-Spatial Attention Module. In the model, each frame’s image is firstly transformed into a spatial visual feature map. Secondly, spatial feature learning is performed by spatial attention to generating attention spatial feature maps. Finally, the generated attention spatial feature map is incorporated with temporal feature vectors to generate a final representation for classification experiments. Experiment results in the hmdb51 and ucf101 data set showed that the proposed model achieved better accuracy than the state-of-art baseline models


2020 ◽  
Vol 10 (4) ◽  
pp. 1482 ◽  
Author(s):  
Jiuqing Dong ◽  
Yongbin Gao ◽  
Hyo Jong Lee ◽  
Heng Zhou ◽  
Yifan Yao ◽  
...  

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance.


Sign in / Sign up

Export Citation Format

Share Document