Deep feature enhancing and selecting network for weakly supervised temporal action localization

Author(s):  
Jiaruo Yu ◽  
Yongxin Ge ◽  
Xiaolei Qin ◽  
Ziqiang Li ◽  
Sheng Huang ◽  
...  
Author(s):  
Guozhang Li ◽  
Jie Li ◽  
Nannan Wang ◽  
Xinpeng Ding ◽  
Zhifeng Li ◽  
...  

2020 ◽  
Vol 34 (07) ◽  
pp. 11053-11060
Author(s):  
Linjiang Huang ◽  
Yan Huang ◽  
Wanli Ouyang ◽  
Liang Wang

In this paper, we propose a weakly supervised temporal action localization method on untrimmed videos based on prototypical networks. We observe two challenges posed by weakly supervision, namely action-background separation and action relation construction. Unlike the previous method, we propose to achieve action-background separation only by the original videos. To achieve this, a clustering loss is adopted to separate actions from backgrounds and learn intra-compact features, which helps in detecting complete action instances. Besides, a similarity weighting module is devised to further separate actions from backgrounds. To effectively identify actions, we propose to construct relations among actions for prototype learning. A GCN-based prototype embedding module is introduced to generate relational prototypes. Experiments on THUMOS14 and ActivityNet1.2 datasets show that our method outperforms the state-of-the-art methods.


2021 ◽  
pp. 42-54
Author(s):  
Xinpeng Ding ◽  
Nannan Wang ◽  
Jie Li ◽  
Xinbo Gao

Author(s):  
Guoqiang Gong ◽  
Liangfeng Zheng ◽  
Wenhao Jiang ◽  
Yadong Mu

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.


2020 ◽  
Vol 34 (07) ◽  
pp. 11320-11327 ◽  
Author(s):  
Pilhyeon Lee ◽  
Youngjung Uh ◽  
Hyeran Byun

Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks – THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.


Sign in / Sign up

Export Citation Format

Share Document