Multimodal Video-text Matching using a Deep Bifurcation Network and Joint Embedding of Visual and Textual Features

2021 ◽  
pp. 115541
Author(s):  
Masoomeh Nabati ◽  
Alireza Behrad
2021 ◽  
Vol 11 (7) ◽  
pp. 3214
Author(s):  
Huy Manh Nguyen ◽  
Tomo Miyazaki ◽  
Yoshihiro Sugaya ◽  
Shinichiro Omachi

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.


Author(s):  
Yaxiong Wang ◽  
Hao Yang ◽  
Xueming Qian ◽  
Lin Ma ◽  
Jing Lu ◽  
...  

Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fine understanding of both modalities. In this paper, we propose a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views. In this work, we integrate the object position clue to enhance the visual-text joint-embedding learning. We first split the images into blocks, by which we infer the relative position of region in the image. Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical news dataset (Tencent-News) to validate the practical application value of proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method can achieve the state-of-art performance on all of these three datasets.


2021 ◽  
Author(s):  
Yang Liu ◽  
Huaqiu Wang ◽  
Fanyang Meng ◽  
Mengyuan Liu ◽  
Hong Liu

2021 ◽  
Author(s):  
Depeng Wang ◽  
Liejun Wang ◽  
Shiji Song ◽  
Gao Huang ◽  
Yuchen Guo ◽  
...  
Keyword(s):  

Author(s):  
Muhammad Umer ◽  
Saima Sadiq ◽  
Malik Muhammad Saad Missen ◽  
Zahid Hameed ◽  
Zahid Aslam ◽  
...  

Author(s):  
Konstantinos Korovesis ◽  
Georgios Alexandridis ◽  
George Caridakis ◽  
Pavlos Polydoras ◽  
Panagiotis Tsantilas
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document