scholarly journals Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches

2020 ◽  
Vol 10 (8) ◽  
pp. 2811
Author(s):  
Fang Liu ◽  
Liang Zhao ◽  
Xiaochun Cheng ◽  
Qin Dai ◽  
Xiangbin Shi ◽  
...  

Effective extraction of human body parts and operated objects participating in action is the key issue of fine-grained action recognition. However, most of the existing methods require intensive manual annotation to train the detectors of these interaction components. In this paper, we represent videos by mid-level patches to avoid the manual annotation, where each patch corresponds to an action-related interaction component. In order to capture mid-level patches more exactly and rapidly, candidate motion regions are extracted by motion saliency. Firstly, the motion regions containing interaction components are segmented by a threshold adaptively calculated according to the saliency histogram of the motion saliency map. Secondly, we introduce a mid-level patch mining algorithm for interaction component detection, with object proposal generation and mid-level patch detection. The object proposal generation algorithm is used to obtain multi-granularity object proposals inspired by the idea of the Huffman algorithm. Based on these object proposals, the mid-level patch detectors are trained by K-means clustering and SVM. Finally, we build a fine-grained action recognition model using a graph structure to describe relationships between the mid-level patches. To recognize actions, the proposed model calculates the appearance and motion features of mid-level patches and the binary motion cooperation relationships between adjacent patches in the graph. Extensive experiments on the MPII cooking database demonstrate that the proposed method gains better results on fine-grained action recognition.

2021 ◽  
Vol 11 (8) ◽  
pp. 996
Author(s):  
James P. Trujillo ◽  
Judith Holler

During natural conversation, people must quickly understand the meaning of what the other speaker is saying. This concerns not just the semantic content of an utterance, but also the social action (i.e., what the utterance is doing—requesting information, offering, evaluating, checking mutual understanding, etc.) that the utterance is performing. The multimodal nature of human language raises the question of whether visual signals may contribute to the rapid processing of such social actions. However, while previous research has shown that how we move reveals the intentions underlying instrumental actions, we do not know whether the intentions underlying fine-grained social actions in conversation are also revealed in our bodily movements. Using a corpus of dyadic conversations combined with manual annotation and motion tracking, we analyzed the kinematics of the torso, head, and hands during the asking of questions. Manual annotation categorized these questions into six more fine-grained social action types (i.e., request for information, other-initiated repair, understanding check, stance or sentiment, self-directed, active participation). We demonstrate, for the first time, that the kinematics of the torso, head and hands differ between some of these different social action categories based on a 900 ms time window that captures movements starting slightly prior to or within 600 ms after utterance onset. These results provide novel insights into the extent to which our intentions shape the way that we move, and provide new avenues for understanding how this phenomenon may facilitate the fast communication of meaning in conversational interaction, social action, and conversation.


Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1280
Author(s):  
Hyeonseok Lee ◽  
Sungchan Kim

Explaining the prediction of deep neural networks makes the networks more understandable and trusted, leading to their use in various mission critical tasks. Recent progress in the learning capability of networks has primarily been due to the enormous number of model parameters, so that it is usually hard to interpret their operations, as opposed to classical white-box models. For this purpose, generating saliency maps is a popular approach to identify the important input features used for the model prediction. Existing explanation methods typically only use the output of the last convolution layer of the model to generate a saliency map, lacking the information included in intermediate layers. Thus, the corresponding explanations are coarse and result in limited accuracy. Although the accuracy can be improved by iteratively developing a saliency map, this is too time-consuming and is thus impractical. To address these problems, we proposed a novel approach to explain the model prediction by developing an attentive surrogate network using the knowledge distillation. The surrogate network aims to generate a fine-grained saliency map corresponding to the model prediction using meaningful regional information presented over all network layers. Experiments demonstrated that the saliency maps are the result of spatially attentive features learned from the distillation. Thus, they are useful for fine-grained classification tasks. Moreover, the proposed method runs at the rate of 24.3 frames per second, which is much faster than the existing methods by orders of magnitude.


2019 ◽  
Vol 29 (10) ◽  
pp. 2986-3000 ◽  
Author(s):  
Zhanpeng Shao ◽  
Youfu Li ◽  
Yao Guo ◽  
Xiaolong Zhou ◽  
Shengyong Chen

2021 ◽  
pp. 620-631
Author(s):  
Xiang Li ◽  
Shenglan Liu ◽  
Yunheng Li ◽  
Hao Liu ◽  
Jinjing Zhao ◽  
...  

Author(s):  
Yang Zhou ◽  
Bingbing Ni ◽  
Shuicheng Yan ◽  
Pierre Moulin ◽  
Qi Tian

Symmetry ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 1397
Author(s):  
Thien-Thu Ngo ◽  
VanDung Nguyen ◽  
Xuan-Qui Pham ◽  
Md-Alamgir Hossain ◽  
Eui-Nam Huh

Intelligent surveillance systems enable secured visibility features in the smart city era. One of the major models for pre-processing in intelligent surveillance systems is known as saliency detection, which provides facilities for multiple tasks such as object detection, object segmentation, video coding, image re-targeting, image-quality assessment, and image compression. Traditional models focus on improving detection accuracy at the cost of high complexity. However, these models are computationally expensive for real-world systems. To cope with this issue, we propose a fast-motion saliency method for surveillance systems under various background conditions. Our method is derived from streaming dynamic mode decomposition (s-DMD), which is a powerful tool in data science. First, DMD computes a set of modes in a streaming manner to derive spatial–temporal features, and a raw saliency map is generated from the sparse reconstruction process. Second, the final saliency map is refined using a difference-of-Gaussians filter in the frequency domain. The effectiveness of the proposed method is validated on a standard benchmark dataset. The experimental results show that the proposed method achieves competitive accuracy with lower complexity than state-of-the-art methods, which satisfies requirements in real-time applications.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Qingwu Li ◽  
Haisu Cheng ◽  
Yan Zhou ◽  
Guanying Huo

Human action recognition in videos is a topic of active research in computer vision. Dense trajectory (DT) features were shown to be efficient for representing videos in state-of-the-art approaches. In this paper, we present a more effective approach of video representation using improved salient dense trajectories: first, detecting the motion salient region and extracting the dense trajectories by tracking interest points in each spatial scale separately and then refining the dense trajectories via the analysis of the motion saliency. Then, we compute several descriptors (i.e., trajectory displacement, HOG, HOF, and MBH) in the spatiotemporal volume aligned with the trajectories. Finally, in order to represent the videos better, we optimize the framework of bag-of-words according to the motion salient intensity distribution and the idea of sparse coefficient reconstruction. Our architecture is trained and evaluated on the four standard video actions datasets of KTH, UCF sports, HMDB51, and UCF50, and the experimental results show that our approach performs competitively comparing with the state-of-the-art results.


Sign in / Sign up

Export Citation Format

Share Document