MIVCN: Multimodal interaction video captioning network based on semantic association graph

Author(s):  
Ying Wang ◽  
Guoheng Huang ◽  
Lin Yuming ◽  
Haoliang Yuan ◽  
Chi-Man Pun ◽  
...  
2019 ◽  
Vol 13 (4) ◽  
pp. 1-24 ◽  
Author(s):  
Wenmain Yang ◽  
Kun Wang ◽  
Na Ruan ◽  
Wenyuan Gao ◽  
Weijia Jia ◽  
...  

2020 ◽  
Vol 10 (8) ◽  
pp. 2641 ◽  
Author(s):  
Petra Đurović ◽  
Ivan Vidović ◽  
Robert Cupec

Most objects are composed of semantically distinctive parts that are more or less geometrically distinctive as well. Points on the object relevant for a certain robot operation are usually determined by various physical properties of the object, such as its dimensions or weight distribution, and by the purpose of object parts. A robot operation defined for a particular part of a representative object can be transferred and adapted to other instances of the same object class by detecting the corresponding components. In this paper, a method for semantic association of the object’s components within the object class is proposed. It is suitable for real-time robotic tasks and requires only a few previously annotated representative models. The proposed approach is based on the component association graph and a novel descriptor that describes the geometrical arrangement of the components. The method is experimentally evaluated on a challenging benchmark dataset.


Author(s):  
Wenmian Yang ◽  
Na Ruan ◽  
Wenyuan Gao ◽  
Kun Wang ◽  
Wensheng Ran ◽  
...  

Author(s):  
Tao Jin ◽  
Siyu Huang ◽  
Ming Chen ◽  
Yingming Li ◽  
Zhongfei Zhang

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.


2019 ◽  
Vol 23 (1) ◽  
pp. 147-159
Author(s):  
Shagan Sah ◽  
Thang Nguyen ◽  
Ray Ptucha

Author(s):  
Alok Singh ◽  
Thoudam Doren Singh ◽  
Sivaji Bandyopadhyay
Keyword(s):  

Author(s):  
Jincan Deng ◽  
Liang Li ◽  
Beichen Zhang ◽  
Shuhui Wang ◽  
Zhengjun Zha ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document