Action recognition in video using a spatial-temporal graph-based feature representation

For the merits of high-order statistics and Riemannian geometry, covariance matrix has become a generic feature representation for action recognition. An independent action can be represented by an empirical statistics over all of its pose samples. Two major problems of covariance include the following: (1) it is prone to be singular so that actions fail to be represented properly, and (2) it is short of global action/pose-aware information so that expressive and discriminative power is limited. In this article, we propose a novel Bayesian covariance representation by a prior regularization method to solve the preceding problems. Specifically, covariance is viewed as a parametric maximum likelihood estimate of Gaussian distribution over local poses from an independent action. Then, a Global Informative Prior (GIP) is generated over global poses with sufficient statistics to regularize covariance. In this way, (1) singularity is greatly relieved due to sufficient statistics, (2) global pose information of GIP makes Bayesian covariance theoretically equivalent to a saliency weighting covariance over global action poses so that discriminative characteristics of actions can be represented more clearly. Experimental results show that our Bayesian covariance with GIP efficiently improves the performance of action recognition. In some databases, it outperforms the state-of-the-art variant methods that are based on kernels, temporal-order structures, and saliency weighting attentions, among others.

Download Full-text

Feature Extraction and Representation for Distributed Multi-View Human Action Recognition

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ◽

10.1109/jetcas.2013.2256824 ◽

2013 ◽

Vol 3 (2) ◽

pp. 145-154 ◽

Cited By ~ 7

Author(s):

Jiajia Luo ◽

Wei Wang ◽

Hairong Qi

Keyword(s):

Action Recognition ◽

Approximation Error ◽

Human Action Recognition ◽

Human Action ◽

Base Station ◽

Feature Representation ◽

Superior Performance ◽

Feature Descriptor ◽

Testing Stage ◽

New Feature

Multi-view human action recognition has gained a lot of attention in recent years for its superior performance as compared to single view recognition. In this paper, we propose a new framework for the real-time realization of human action recognition in distributed camera networks (DCNs). We first present a new feature descriptor (Mltp-hist) that is tolerant to illumination change, robust in homogeneous region and computationally efficient. Taking advantage of the proposed Mltp-hist, the noninformative 3-D patches generated from the background can be further removed automatically that effectively highlights the foreground patches. Next, a new feature representation method based on sparse coding is presented to generate the histogram representation of local videos to be transmitted to the base station for classification. Due to the sparse representation of extracted features, the approximation error is reduced. Finally, at the base station, a probability model is produced to fuse the information from various views and a class label is assigned accordingly. Compared to the existing algorithms, the proposed framework has three advantages while having less requirements on memory and bandwidth consumption: 1) no preprocessing is required; 2) communication among cameras is unnecessary; and 3) positions and orientations of cameras do not need to be fixed. We further evaluate the proposed framework on the most popular multi-view action dataset IXMAS. Experimental results indicate that our proposed framework repeatedly achieves state-of-the-art results when various numbers of views are tested. In addition, our approach is tolerant to the various combination of views and benefit from introducing more views at the testing stage. Especially, our results are still satisfactory even when large misalignment exists between the training and testing samples.

Download Full-text

Human Action Recognition Based on Normalized Interest Points and Super-Interest Points

International Journal of Humanoid Robotics ◽

10.1142/s0219843614500054 ◽

2014 ◽

Vol 11 (01) ◽

pp. 1450005

Author(s):

Yangyang Wang ◽

Yibo Li ◽

Xiaofei Ji

Keyword(s):

Action Recognition ◽

Clustering Algorithm ◽

Three Dimensional ◽

Temporal Correlation ◽

Human Action Recognition ◽

Human Action ◽

Feature Representation ◽

Interest Point ◽

Interest Points ◽

Active Research

Visual-based human action recognition is currently one of the most active research topics in computer vision. The feature representation directly has a crucial impact on the performance of the recognition. Feature representation based on bag-of-words is popular in current research, but the spatial and temporal relationship among these features is usually discarded. In order to solve this issue, a novel feature representation based on normalized interest points is proposed and utilized to recognize the human actions. The novel representation is called super-interest point. The novelty of the proposed feature is that the spatial-temporal correlation between the interest points and human body can be directly added to the representation without considering scale and location variance of the points by introducing normalized points clustering. The novelty concerns three tasks. First, to solve the diversity of human location and scale, interest points are normalized based on the normalization of the human region. Second, to obtain the spatial-temporal correlation among the interest points, the normalized points with similar spatial and temporal distance are constructed to a super-interest point by using three-dimensional clustering algorithm. Finally, by describing the appearance characteristic of the super-interest points and location relationship among the super-interest points, a new feature representation is gained. The proposed representation formation sets up the relationship among local features and human figure. Experiments on Weizmann, KTH, and UCF sports dataset demonstrate that the proposed feature is effective for human action recognition.

Download Full-text

Enhanced Spatial and Extended Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Sensors ◽

10.3390/s20185260 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5260 ◽

Cited By ~ 1

Author(s):

Fanjia Li ◽

Juanjuan Li ◽

Aichun Zhu ◽

Yonggang Xu ◽

Hongsheng Yin ◽

...

Keyword(s):

Action Recognition ◽

Large Scale ◽

Optimal Solution ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Network ◽

Spatial Graph ◽

Serial Connection ◽

In Series ◽

Temporal Graph

In the skeleton-based human action recognition domain, the spatial-temporal graph convolution networks (ST-GCNs) have made great progress recently. However, they use only one fixed temporal convolution kernel, which is not enough to extract the temporal cues comprehensively. Moreover, simply connecting the spatial graph convolution layer (GCL) and the temporal GCL in series is not the optimal solution. To this end, we propose a novel enhanced spatial and extended temporal graph convolutional network (EE-GCN) in this paper. Three convolution kernels with different sizes are chosen to extract the discriminative temporal features from shorter to longer terms. The corresponding GCLs are then concatenated by a powerful yet efficient one-shot aggregation (OSA) + effective squeeze-excitation (eSE) structure. The OSA module aggregates the features from each layer once to the output, and the eSE module explores the interdependency between the channels of the output. Besides, we propose a new connection paradigm to enhance the spatial features, which expand the serial connection to a combination of serial and parallel connections by adding a spatial GCL in parallel with the temporal GCLs. The proposed method is evaluated on three large scale datasets, and the experimental results show that the performance of our method exceeds previous state-of-the-art methods.

Download Full-text

Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition

Pattern Recognition ◽

10.1016/j.patcog.2012.10.004 ◽

2013 ◽

Vol 46 (7) ◽

pp. 1810-1818 ◽

Cited By ~ 68

Author(s):

Li Liu ◽

Ling Shao ◽

Peter Rockett

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Feature Representation ◽

Frame Selection ◽

Motion Feature ◽

Key Frame ◽

Key Frame Selection

Download Full-text

An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition

Multimedia Tools and Applications ◽

10.1007/s11042-020-08611-4 ◽

2020 ◽

Vol 79 (17-18) ◽

pp. 12679-12697

Author(s):

Dong Tian ◽

Zhe-Ming Lu ◽

Xiao Chen ◽

Long-Hua Ma

Keyword(s):

Action Recognition ◽

Feature Learning ◽

Convolutional Network ◽

Temporal Graph

Download Full-text

Fusing appearance and motion information for action recognition on depth sequences

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200954 ◽

2021 ◽

pp. 1-13

Author(s):

Cong Pei ◽

Feng Jiang ◽

Mao Li

Keyword(s):

Action Recognition ◽

Feature Representation ◽

Depth Cameras ◽

Feature Descriptors ◽

Single Feature ◽

Benchmark Datasets ◽

Depth Sequences ◽

Cost Efficient ◽

New Feature ◽

Representation Scheme

With the advent of cost-efficient depth cameras, many effective feature descriptors have been proposed for action recognition from depth sequences. However, most of them are based on single feature and thus unable to extract the action information comprehensively, e.g., some kinds of feature descriptors can represent the area where the motion occurs while they lack the ability of describing the order in which the action is performed. In this paper, a new feature representation scheme combining different feature descriptors is proposed to capture various aspects of action cues simultaneously. First of all, a depth sequence is divided into a series of sub-sequences using motion energy based spatial-temporal pyramid. For each sub-sequence, on the one hand, the depth motion maps (DMMs) based completed local binary pattern (CLBP) descriptors are calculated through a patch-based strategy. On the other hand, each sub-sequence is partitioned into spatial grids and the polynormals descriptors are obtained for each of the grid sequences. Then, the sparse representation vectors of the DMMs based CLBP and the polynormals are calculated separately. After pooling, the ultimate representation vector of the sample is generated as the input of the classifier. Finally, two different fusion strategies are applied to conduct fusion. Through extensive experiments on two benchmark datasets, the performance of the proposed method is proved better than that of each single feature based recognition method.

Download Full-text

An Attention Enhanced Spatial–Temporal Graph Convolutional LSTM Network for Action Recognition in Karate

Applied Sciences ◽

10.3390/app11188641 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8641

Author(s):

Jianping Guo ◽

Hong Liu ◽

Xi Li ◽

Dahong Xu ◽

Yihan Zhang

Keyword(s):

Artificial Intelligence ◽

Action Recognition ◽

Structural Information ◽

Human Action Recognition ◽

Human Action ◽

Competitive Sports ◽

Convolutional Networks ◽

Convolution Model ◽

Artificial Intelligence Technology ◽

Temporal Graph

With the increasing popularity of artificial intelligence applications, artificial intelligence technology has begun to be applied in competitive sports. These applications have promoted the improvement of athletes’ competitive ability, as well as the fitness of the masses. Human action recognition technology, based on deep learning, has gradually been applied to the analysis of the technical actions of competitive sports athletes, as well as the analysis of tactics. In this paper, a new graph convolution model is proposed. Delaunay’s partitioning algorithm was used to construct a new spatiotemporal topology which can effectively obtain the structural information and spatiotemporal features of athletes’ technical actions. At the same time, the attention mechanism was integrated into the model, and different weight coefficients were assigned to the joints, which significantly improved the accuracy of technical action recognition. First, a comparison between the current state-of-the-art methods was undertaken using the general datasets of Kinect and NTU-RGB + D. The performance of the new algorithm model was slightly improved in comparison to the general dataset. Then, the performance of our algorithm was compared with spatial temporal graph convolutional networks (ST-GCN) for the karate technique action dataset. We found that the accuracy of our algorithm was significantly improved.

Download Full-text

View-Invariant Feature Representation for Action Recognition under Multiple Views

International Journal of Intelligent Engineering and Systems ◽

10.22266/ijies2019.1231.01 ◽

2019 ◽

Vol 12 (6) ◽

pp. 1-13

Author(s):

Kumbala Reddy ◽

◽

Gullipalli Naidu ◽

Bulusu Vardhan ◽

◽

...

Keyword(s):

Action Recognition ◽

Feature Representation ◽

Multiple Views ◽

Invariant Feature

Download Full-text

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

10.1145/3474085.3475574 ◽

2021 ◽

Author(s):

Tailin Chen ◽

Desen Zhou ◽

Jian Wang ◽

Shidong Wang ◽

Yu Guan ◽

...

Keyword(s):

Action Recognition ◽

Temporal Graph ◽

Spatio Temporal

Download Full-text

Action recognition in video using a spatial-temporal graph-based feature representation

Bayesian Covariance Representation with Global Informative Prior for 3D Action Recognition

Feature Extraction and Representation for Distributed Multi-View Human Action Recognition

Human Action Recognition Based on Normalized Interest Points and Super-Interest Points

Enhanced Spatial and Extended Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition

An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition

Fusing appearance and motion information for action recognition on depth sequences

An Attention Enhanced Spatial–Temporal Graph Convolutional LSTM Network for Action Recognition in Karate

View-Invariant Feature Representation for Action Recognition under Multiple Views

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Export Citation Format