Hierarchical Attention-Based Video Captioning Using Key Frames

Recently many methods use encoder-decoder framework for video captioning, aiming to translate short videos into natural language. These methods usually use equal interval frame sampling. However, lacking a good efficiency in sampling, it has a high temporal and spatial redundancy, resulting in unnecessary computation cost. In addition, the existing approaches simply splice different visual features on the fully connection layer. Therefore, features cannot be effectively utilized. In order to solve the defects, we proposed filtration network (FN) to select key frames, which is trained by deep reinforcement learning algorithm actor-double-critic. According to behavior psychology, the core idea of actor-double-critic is that the behavior of agent is determined by both the external environment and the internal personality. It avoids the phenomenon of unclear reward and sparse feedback in training because it gives steady feedback after each action. The key frames are sent to combine codec network (CCN) to generate sentences. The operation of feature combination in CCN make fusion of visual features by complex number representation to make good semantic modeling. Experiments and comparisons with other methods on two datasets (MSVD/MSR-VTT) show that our approach achieves better performance in terms of four metrics, BLEU-4, METEOR, ROUGE-L and CIDEr.

Download Full-text

Understanding temporal structure for video captioning

Pattern Analysis and Applications ◽

10.1007/s10044-018-00770-3 ◽

2019 ◽

Vol 23 (1) ◽

pp. 147-159

Author(s):

Shagan Sah ◽

Thang Nguyen ◽

Ray Ptucha

Keyword(s):

Temporal Structure ◽

Video Captioning

Download Full-text

Correction to: Attention based video captioning framework for Hindi

Multimedia Systems ◽

10.1007/s00530-021-00834-1 ◽

2021 ◽

Author(s):

Alok Singh ◽

Thoudam Doren Singh ◽

Sivaji Bandyopadhyay

Keyword(s):

Video Captioning

Download Full-text

Syntax-guided Hierarchical Attention Network for Video Captioning

IEEE Transactions on Circuits and Systems for Video Technology ◽

10.1109/tcsvt.2021.3063423 ◽

2021 ◽

pp. 1-1

Author(s):

Jincan Deng ◽

Liang Li ◽

Beichen Zhang ◽

Shuhui Wang ◽

Zhengjun Zha ◽

...

Keyword(s):

Attention Network ◽

Video Captioning

Download Full-text

VideoTRM: Pre-training for Video Captioning Challenge 2020

Proceedings of the 28th ACM International Conference on Multimedia ◽

10.1145/3394171.3416291 ◽

2020 ◽

Author(s):

Jingwen Chen ◽

Hongyang Chao

Keyword(s):

Video Captioning

Download Full-text

Olympic Games Event Recognition via Transfer Learning with Photobombing Guided Data Augmentation

Journal of Imaging ◽

10.3390/jimaging7020012 ◽

2021 ◽

Vol 7 (2) ◽

pp. 12

Author(s):

Yousef I. Mohamad ◽

Samah S. Baraheem ◽

Tam V. Nguyen

Keyword(s):

Deep Learning ◽

Transfer Learning ◽

Data Augmentation ◽

Olympic Games ◽

Event Recognition ◽

Surveillance Systems ◽

Video Captioning ◽

Practical Applications ◽

Sport Events ◽

The Olympic Games

Automatic event recognition in sports photos is both an interesting and valuable research topic in the field of computer vision and deep learning. With the rapid increase and the explosive spread of data, which is being captured momentarily, the need for fast and precise access to the right information has become a challenging task with considerable importance for multiple practical applications, i.e., sports image and video search, sport data analysis, healthcare monitoring applications, monitoring and surveillance systems for indoor and outdoor activities, and video captioning. In this paper, we evaluate different deep learning models in recognizing and interpreting the sport events in the Olympic Games. To this end, we collect a dataset dubbed Olympic Games Event Image Dataset (OGED) including 10 different sport events scheduled for the Olympic Games Tokyo 2020. Then, the transfer learning is applied on three popular deep convolutional neural network architectures, namely, AlexNet, VGG-16 and ResNet-50 along with various data augmentation methods. Extensive experiments show that ResNet-50 with the proposed photobombing guided data augmentation achieves 90% in terms of accuracy.

Download Full-text