Relational Prototypical Network for Weakly Supervised Temporal Action Localization

Linjiang Huang; Yan Huang; Wanli Ouyang; Liang Wang

doi:10.1609/aaai.v34i07.6760

Relational Prototypical Network for Weakly Supervised Temporal Action Localization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6760 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11053-11060

Author(s):

Linjiang Huang ◽

Yan Huang ◽

Wanli Ouyang ◽

Liang Wang

Keyword(s):

State Of The Art ◽

Previous Method ◽

The State ◽

Localization Method ◽

Action Localization ◽

Art Methods ◽

Prototype Learning ◽

Weakly Supervised ◽

Temporal Action

In this paper, we propose a weakly supervised temporal action localization method on untrimmed videos based on prototypical networks. We observe two challenges posed by weakly supervision, namely action-background separation and action relation construction. Unlike the previous method, we propose to achieve action-background separation only by the original videos. To achieve this, a clustering loss is adopted to separate actions from backgrounds and learn intra-compact features, which helps in detecting complete action instances. Besides, a similarity weighting module is devised to further separate actions from backgrounds. To effectively identify actions, we propose to construct relations among actions for prototype learning. A GCN-based prototype embedding module is introduced to generate relational prototypes. Experiments on THUMOS14 and ActivityNet1.2 datasets show that our method outperforms the state-of-the-art methods.

Download Full-text

Self-Supervised Video Action Localization with Adversarial Temporal Transforms

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/96 ◽

2021 ◽

Author(s):

Guoqiang Gong ◽

Liangfeng Zheng ◽

Wenhao Jiang ◽

Yadong Mu

Keyword(s):

State Of The Art ◽

Experimental Results ◽

Policy Network ◽

Video Classification ◽

Time Warping ◽

Action Localization ◽

Temporal Boundary ◽

Consistency Constraint ◽

Weakly Supervised ◽

Temporal Action

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.

Download Full-text

Background Suppression Network for Weakly-Supervised Temporal Action Localization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6793 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11320-11327 ◽

Cited By ~ 4

Author(s):

Pilhyeon Lee ◽

Youngjung Uh ◽

Hyeran Byun

Keyword(s):

State Of The Art ◽

Background Suppression ◽

Not Given ◽

Training Strategy ◽

Training Stage ◽

Action Localization ◽

Localization Performance ◽

Weakly Supervised ◽

Branch Weight ◽

Temporal Action

Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks – THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.

Download Full-text

PFWNet: Pretraining Neural Network via Feature Jigsaw Puzzle for Weakly-supervised Temporal Action Localization

Neurocomputing ◽

10.1016/j.neucom.2021.02.086 ◽

2021 ◽

Author(s):

Binglu Wang ◽

Yongqiang Zhao ◽

Yani Zhang

Keyword(s):

Neural Network ◽

Jigsaw Puzzle ◽

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning

10.1145/3474085.3475261 ◽

2021 ◽

Author(s):

Yuan Ji ◽

Xu Jia ◽

Huchuan Lu ◽

Xiang Ruan

Keyword(s):

Collaborative Learning ◽

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Smoothing vs. sharpening of colour images: Together or separated

Applied Mathematics and Nonlinear Sciences ◽

10.21042/amns.2017.1.00025 ◽

2017 ◽

Vol 2 (1) ◽

pp. 299-316 ◽

Cited By ~ 8

Author(s):

Cristina Pérez-Benito ◽

Samuel Morillas ◽

Cristina Jordán ◽

J. Alberto Conejero

Keyword(s):

Image Denoising ◽

Gaussian Distribution ◽

State Of The Art ◽

Visual Quality ◽

The State ◽

Art Methods ◽

Efficiency And Effectiveness

AbstractIt is still a challenge to improve the efficiency and effectiveness of image denoising and enhancement methods. There exists denoising and enhancement methods that are able to improve visual quality of images. This is usually obtained by removing noise while sharpening details and improving edges contrast. Smoothing refers to the case of denoising when noise follows a Gaussian distribution.Both operations, smoothing noise and sharpening, have an opposite nature. Therefore, there are few approaches that simultaneously respond to both goals. We will review these methods and we will also provide a detailed study of the state-of-the-art methods that attack both problems in colour images, separately.

Download Full-text

Comparative Quality Estimation for Machine Translation Observations on Machine Learning and Features

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0029 ◽

2017 ◽

Vol 108 (1) ◽

pp. 307-318 ◽

Cited By ~ 1

Author(s):

Eleftherios Avramidis

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Machine Translation ◽

State Of The Art ◽

Linear Method ◽

The State ◽

Quality Estimation ◽

Art Methods ◽

Improved Performance

AbstractA deeper analysis on Comparative Quality Estimation is presented by extending the state-of-the-art methods with adequacy and grammatical features from other Quality Estimation tasks. The previously used linear method, unable to cope with the augmented features, is replaced with a boosting classifier assisted by feature selection. The methods indicated show improved performance for 6 language pairs, when applied on the output from MT systems developed over 7 years. The improved models compete better with reference-aware metrics.Notable conclusions are reached through the examination of the contribution of the features in the models, whereas it is possible to identify common MT errors that are captured by the features. Many grammatical/fluency features have a good contribution, few adequacy features have some contribution, whereas source complexity features are of no use. The importance of many fluency and adequacy features is language-specific.

Download Full-text

Multi-Hierarchical Category Supervision for Weakly-Supervised Temporal Action Localization

IEEE Transactions on Image Processing ◽

10.1109/tip.2021.3124671 ◽

2021 ◽

pp. 1-1

Author(s):

Guozhang Li ◽

Jie Li ◽

Nannan Wang ◽

Xinpeng Ding ◽

Zhifeng Li ◽

...

Keyword(s):

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Towards Zero Defect Manufacturing paradigm: A review of the state-of-the-art methods and open challenges

Computers in Industry ◽

10.1016/j.compind.2021.103548 ◽

2022 ◽

Vol 134 ◽

pp. 103548

Author(s):

Bianca Caiazzo ◽

Mario Di Nardo ◽

Teresa Murino ◽

Alberto Petrillo ◽

Gianluca Piccirillo ◽

...

Keyword(s):

State Of The Art ◽

The State ◽

Art Methods

Download Full-text

Uniformity Attentive Learning-Based Siamese Network for Person Re-Identification

Sensors ◽

10.3390/s20123603 ◽

2020 ◽

Vol 20 (12) ◽

pp. 3603

Author(s):

Dasol Jeong ◽

Hasil Park ◽

Joongchol Shin ◽

Donggoo Kang ◽

Joonki Paik

Keyword(s):

Large Scale ◽

Body Shape ◽

State Of The Art ◽

The State ◽

Whole Body ◽

Distinctive Features ◽

Common Features ◽

Siamese Network ◽

Art Methods ◽

Triplet Loss

Person re-identification (Re-ID) has a problem that makes learning difficult such as misalignment and occlusion. To solve these problems, it is important to focus on robust features in intra-class variation. Existing attention-based Re-ID methods focus only on common features without considering distinctive features. In this paper, we present a novel attentive learning-based Siamese network for person Re-ID. Unlike existing methods, we designed an attention module and attention loss using the properties of the Siamese network to concentrate attention on common and distinctive features. The attention module consists of channel attention to select important channels and encoder-decoder attention to observe the whole body shape. We modified the triplet loss into an attention loss, called uniformity loss. The uniformity loss generates a unique attention map, which focuses on both common and discriminative features. Extensive experiments show that the proposed network compares favorably to the state-of-the-art methods on three large-scale benchmarks including Market-1501, CUHK03 and DukeMTMC-ReID datasets.

Download Full-text

Weakly Supervised Temporal Action Localization Using Deep Metric Learning

2020 IEEE Winter Conference on Applications of Computer Vision (WACV) ◽

10.1109/wacv45572.2020.9093620 ◽

2020 ◽

Author(s):

Ashraful Islam ◽

Richard J. Radke

Keyword(s):

Metric Learning ◽

Action Localization ◽

Deep Metric Learning ◽

Weakly Supervised ◽

Temporal Action

Download Full-text