Human Action Retrieval via Spatio-temporal Cuboids

Author(s):  
Qingshan Luo ◽  
Guihua Zeng
2019 ◽  
Vol 19 (03) ◽  
pp. 1950018
Author(s):  
Christos Veinidis ◽  
Antonios Danelakis ◽  
Ioannis Pratikakis ◽  
Theoharis Theoharis

Two novel methods for fully unsupervised human action retrieval using 3D mesh sequences are presented. The first achieves high accuracy but is suitable for sequences consisting of clean meshes, such as artificial sequences or highly post-processed real sequences, while the second one is robust and suitable for noisy meshes, such as those that often result from unprocessed scanning or 3D surface reconstruction errors. The first method uses a spatio-temporal descriptor based on the trajectories of 6 salient points of the human body (i.e. the centroid, the top of the head and the ends of the two upper and two lower limbs) from which a set of kinematic features are extracted. The resulting features are transformed using the wavelet transformation in different scales and a set of statistics are used to obtain the descriptor. An important characteristic of this descriptor is that its length is constant independent of the number of frames in the sequence. The second descriptor consists of two complementary sub-descriptors, one based on the trajectory of the centroid of the human body across frames and the other based on the Hybrid static shape descriptor adapted for mesh sequences. The robustness of the second descriptor derives from the robustness involved in extracting the centroid and the Hybrid sub-descriptors. Performance figures on publicly available real and artificial datasets demonstrate our accuracy and robustness claims and in most cases the results outperform the state-of-the-art.


2012 ◽  
Vol 33 (4) ◽  
pp. 446-452 ◽  
Author(s):  
Simon Jones ◽  
Ling Shao ◽  
Jianguo Zhang ◽  
Yan Liu

2020 ◽  
Vol 79 (17-18) ◽  
pp. 12349-12371
Author(s):  
Qingshan She ◽  
Gaoyuan Mu ◽  
Haitao Gan ◽  
Yingle Fan

2020 ◽  
Vol 10 (12) ◽  
pp. 4412
Author(s):  
Ammar Mohsin Butt ◽  
Muhammad Haroon Yousaf ◽  
Fiza Murtaza ◽  
Saima Nazir ◽  
Serestina Viriri ◽  
...  

Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.


2020 ◽  
Vol 34 (07) ◽  
pp. 12886-12893
Author(s):  
Xiao-Yu Zhang ◽  
Haichao Shi ◽  
Changsheng Li ◽  
Peng Li

Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.


Sign in / Sign up

Export Citation Format

Share Document