Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches

Effective extraction of human body parts and operated objects participating in action is the key issue of fine-grained action recognition. However, most of the existing methods require intensive manual annotation to train the detectors of these interaction components. In this paper, we represent videos by mid-level patches to avoid the manual annotation, where each patch corresponds to an action-related interaction component. In order to capture mid-level patches more exactly and rapidly, candidate motion regions are extracted by motion saliency. Firstly, the motion regions containing interaction components are segmented by a threshold adaptively calculated according to the saliency histogram of the motion saliency map. Secondly, we introduce a mid-level patch mining algorithm for interaction component detection, with object proposal generation and mid-level patch detection. The object proposal generation algorithm is used to obtain multi-granularity object proposals inspired by the idea of the Huffman algorithm. Based on these object proposals, the mid-level patch detectors are trained by K-means clustering and SVM. Finally, we build a fine-grained action recognition model using a graph structure to describe relationships between the mid-level patches. To recognize actions, the proposed model calculates the appearance and motion features of mid-level patches and the binary motion cooperation relationships between adjacent patches in the graph. Extensive experiments on the MPII cooking database demonstrate that the proposed method gains better results on fine-grained action recognition.

Download Full-text

The Kinematics of Social Action: Visual Signals Provide Cues for What Interlocutors Do in Conversation

Brain Sciences ◽

10.3390/brainsci11080996 ◽

2021 ◽

Vol 11 (8) ◽

pp. 996

Author(s):

James P. Trujillo ◽

Judith Holler

Keyword(s):

Social Action ◽

Motion Tracking ◽

Time Window ◽

Semantic Content ◽

Visual Signals ◽

Manual Annotation ◽

Conversational Interaction ◽

Fine Grained ◽

Social Actions ◽

Rapid Processing

During natural conversation, people must quickly understand the meaning of what the other speaker is saying. This concerns not just the semantic content of an utterance, but also the social action (i.e., what the utterance is doing—requesting information, offering, evaluating, checking mutual understanding, etc.) that the utterance is performing. The multimodal nature of human language raises the question of whether visual signals may contribute to the rapid processing of such social actions. However, while previous research has shown that how we move reveals the intentions underlying instrumental actions, we do not know whether the intentions underlying fine-grained social actions in conversation are also revealed in our bodily movements. Using a corpus of dyadic conversations combined with manual annotation and motion tracking, we analyzed the kinematics of the torso, head, and hands during the asking of questions. Manual annotation categorized these questions into six more fine-grained social action types (i.e., request for information, other-initiated repair, understanding check, stance or sentiment, self-directed, active participation). We demonstrate, for the first time, that the kinematics of the torso, head and hands differ between some of these different social action categories based on a 900 ms time window that captures movements starting slightly prior to or within 600 ms after utterance onset. These results provide novel insights into the extent to which our intentions shape the way that we move, and provide new avenues for understanding how this phenomenon may facilitate the fast communication of meaning in conversational interaction, social action, and conversation.

Download Full-text

Explaining Neural Networks Using Attentive Knowledge Distillation

Sensors ◽

10.3390/s21041280 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1280

Author(s):

Hyeonseok Lee ◽

Sungchan Kim

Keyword(s):

Neural Networks ◽

Model Prediction ◽

Saliency Map ◽

Model Parameters ◽

Learning Capability ◽

Fine Grained ◽

Network Layers ◽

Saliency Maps ◽

Novel Approach ◽

Knowledge Distillation

Explaining the prediction of deep neural networks makes the networks more understandable and trusted, leading to their use in various mission critical tasks. Recent progress in the learning capability of networks has primarily been due to the enormous number of model parameters, so that it is usually hard to interpret their operations, as opposed to classical white-box models. For this purpose, generating saliency maps is a popular approach to identify the important input features used for the model prediction. Existing explanation methods typically only use the output of the last convolution layer of the model to generate a saliency map, lacking the information included in intermediate layers. Thus, the corresponding explanations are coarse and result in limited accuracy. Although the accuracy can be improved by iteratively developing a saliency map, this is too time-consuming and is thus impractical. To address these problems, we proposed a novel approach to explain the model prediction by developing an attentive surrogate network using the knowledge distillation. The surrogate network aims to generate a fine-grained saliency map corresponding to the model prediction using meaningful regional information presented over all network layers. Experiments demonstrated that the saliency maps are the result of spatially attentive features learned from the distillation. Thus, they are useful for fine-grained classification tasks. Moreover, the proposed method runs at the rate of 24.3 frames per second, which is much faster than the existing methods by orders of magnitude.

Download Full-text

A Hierarchical Model for Action Recognition Based on Body Parts

2018 IEEE International Conference on Robotics and Automation (ICRA) ◽

10.1109/icra.2018.8460516 ◽

2018 ◽

Author(s):

Zhanpeng Shao ◽

Youfu Li ◽

Yao Guo ◽

Jianyu Yang ◽

Zhenhua Wang

Keyword(s):

Action Recognition ◽

Hierarchical Model ◽

Body Parts

Download Full-text

Hand Detection and Tracking in Videos for Fine-Grained Action Recognition

Computer Vision - ACCV 2014 Workshops - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16628-5_2 ◽

2015 ◽

pp. 19-34

Author(s):

Nga H. Do ◽

Keiji Yanai

Keyword(s):

Action Recognition ◽

Hand Detection ◽

Fine Grained ◽

Detection And Tracking

Download Full-text

A Hierarchical Model for Human Action Recognition From Body-Parts

IEEE Transactions on Circuits and Systems for Video Technology ◽

10.1109/tcsvt.2018.2871660 ◽

2019 ◽

Vol 29 (10) ◽

pp. 2986-3000 ◽

Cited By ~ 1

Author(s):

Zhanpeng Shao ◽

Youfu Li ◽

Yao Guo ◽

Xiaolong Zhou ◽

Shengyong Chen

Keyword(s):

Action Recognition ◽

Hierarchical Model ◽

Human Action Recognition ◽

Human Action ◽

Body Parts

Download Full-text

Spatial-Temporal Attention Network with Multi-similarity Loss for Fine-Grained Skeleton-Based Action Recognition

10.1007/978-3-030-92270-2_53 ◽

2021 ◽

pp. 620-631

Author(s):

Xiang Li ◽

Shenglan Liu ◽

Yunheng Li ◽

Hao Liu ◽

Jinjing Zhao ◽

...

Keyword(s):

Action Recognition ◽

Temporal Attention ◽

Attention Network ◽

Fine Grained

Download Full-text

Fine-Grained Action Recognition on a Novel Basketball Dataset

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9053928 ◽

2020 ◽

Author(s):

Xiaofan Gu ◽

Xinwei Xue ◽

Feng Wang

Keyword(s):

Action Recognition ◽

Fine Grained

Download Full-text

Pipelining Localized Semantic Features for Fine-Grained Action Recognition

Computer Vision – ECCV 2014 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-10593-2_32 ◽

2014 ◽

pp. 481-496 ◽

Cited By ~ 9

Author(s):

Yang Zhou ◽

Bingbing Ni ◽

Shuicheng Yan ◽

Pierre Moulin ◽

Qi Tian

Keyword(s):

Action Recognition ◽

Semantic Features ◽

Fine Grained

Download Full-text

Motion Saliency Detection for Surveillance Systems Using Streaming Dynamic Mode Decomposition

Symmetry ◽

10.3390/sym12091397 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1397

Author(s):

Thien-Thu Ngo ◽

VanDung Nguyen ◽

Xuan-Qui Pham ◽

Md-Alamgir Hossain ◽

Eui-Nam Huh

Keyword(s):

Saliency Detection ◽

Saliency Map ◽

Dynamic Mode Decomposition ◽

Dynamic Mode ◽

Detection Accuracy ◽

Surveillance Systems ◽

Reconstruction Process ◽

Intelligent Surveillance ◽

Mode Decomposition ◽

Motion Saliency

Intelligent surveillance systems enable secured visibility features in the smart city era. One of the major models for pre-processing in intelligent surveillance systems is known as saliency detection, which provides facilities for multiple tasks such as object detection, object segmentation, video coding, image re-targeting, image-quality assessment, and image compression. Traditional models focus on improving detection accuracy at the cost of high complexity. However, these models are computationally expensive for real-world systems. To cope with this issue, we propose a fast-motion saliency method for surveillance systems under various background conditions. Our method is derived from streaming dynamic mode decomposition (s-DMD), which is a powerful tool in data science. First, DMD computes a set of modes in a streaming manner to derive spatial–temporal features, and a raw saliency map is generated from the sparse reconstruction process. Second, the final saliency map is refined using a difference-of-Gaussians filter in the frequency domain. The effectiveness of the proposed method is validated on a standard benchmark dataset. The experimental results show that the proposed method achieves competitive accuracy with lower complexity than state-of-the-art methods, which satisfies requirements in real-time applications.

Download Full-text

Human Action Recognition Using Improved Salient Dense Trajectories

Computational Intelligence and Neuroscience ◽

10.1155/2016/6750459 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Qingwu Li ◽

Haisu Cheng ◽

Yan Zhou ◽

Guanying Huo

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Interest Points ◽

Dense Trajectories ◽

Dense Trajectory ◽

Sparse Coefficient ◽

Active Research ◽

Motion Saliency

Human action recognition in videos is a topic of active research in computer vision. Dense trajectory (DT) features were shown to be efficient for representing videos in state-of-the-art approaches. In this paper, we present a more effective approach of video representation using improved salient dense trajectories: first, detecting the motion salient region and extracting the dense trajectories by tracking interest points in each spatial scale separately and then refining the dense trajectories via the analysis of the motion saliency. Then, we compute several descriptors (i.e., trajectory displacement, HOG, HOF, and MBH) in the spatiotemporal volume aligned with the trajectories. Finally, in order to represent the videos better, we optimize the framework of bag-of-words according to the motion salient intensity distribution and the idea of sparse coefficient reconstruction. Our architecture is trained and evaluated on the four standard video actions datasets of KTH, UCF sports, HMDB51, and UCF50, and the experimental results show that our approach performs competitively comparing with the state-of-the-art results.

Download Full-text