Deep feature enhancing and selecting network for weakly supervised temporal action localization

In this paper, we propose a weakly supervised temporal action localization method on untrimmed videos based on prototypical networks. We observe two challenges posed by weakly supervision, namely action-background separation and action relation construction. Unlike the previous method, we propose to achieve action-background separation only by the original videos. To achieve this, a clustering loss is adopted to separate actions from backgrounds and learn intra-compact features, which helps in detecting complete action instances. Besides, a similarity weighting module is devised to further separate actions from backgrounds. To effectively identify actions, we propose to construct relations among actions for prototype learning. A GCN-based prototype embedding module is introduced to generate relational prototypes. Experiments on THUMOS14 and ActivityNet1.2 datasets show that our method outperforms the state-of-the-art methods.

Download Full-text

Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization

Computer Vision – ACCV 2018 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-20890-5_36 ◽

2019 ◽

pp. 558-574

Author(s):

Haisheng Su ◽

Xu Zhao ◽

Tianwei Lin

Keyword(s):

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Weakly Supervised Temporal Action Localization with Segment-Level Labels

10.1007/978-3-030-88004-0_4 ◽

2021 ◽

pp. 42-54

Author(s):

Xinpeng Ding ◽

Nannan Wang ◽

Jie Li ◽

Xinbo Gao

Keyword(s):

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Self-Supervised Video Action Localization with Adversarial Temporal Transforms

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/96 ◽

2021 ◽

Author(s):

Guoqiang Gong ◽

Liangfeng Zheng ◽

Wenhao Jiang ◽

Yadong Mu

Keyword(s):

State Of The Art ◽

Experimental Results ◽

Policy Network ◽

Video Classification ◽

Time Warping ◽

Action Localization ◽

Temporal Boundary ◽

Consistency Constraint ◽

Weakly Supervised ◽

Temporal Action

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.

Download Full-text

Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks

2019 IEEE/CVF International Conference on Computer Vision (ICCV) ◽

10.1109/iccv.2019.00400 ◽

2019 ◽

Cited By ~ 8

Author(s):

Ziyi Liu ◽

Le Wang ◽

Qilin Zhang ◽

Zhanning Gao ◽

Zhenxing Niu ◽

...

Keyword(s):

Action Localization ◽

Weakly Supervised ◽

Temporal Action

Download Full-text

Background Suppression Network for Weakly-Supervised Temporal Action Localization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6793 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11320-11327 ◽

Cited By ~ 4

Author(s):

Pilhyeon Lee ◽

Youngjung Uh ◽

Hyeran Byun

Keyword(s):

State Of The Art ◽

Background Suppression ◽

Not Given ◽

Training Strategy ◽

Training Stage ◽

Action Localization ◽

Localization Performance ◽

Weakly Supervised ◽

Branch Weight ◽

Temporal Action

Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks – THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.

Download Full-text