Graph-based multi-space semantic correlation propagation for video retrieval

2010 ◽  
Vol 27 (1) ◽  
pp. 21-34 ◽  
Author(s):  
Bailan Feng ◽  
Juan Cao ◽  
Xiuguo Bao ◽  
Lei Bao ◽  
Yongdong Zhang ◽  
...  
Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2125
Author(s):  
Xiaoyu Wu ◽  
Tiantian Wang ◽  
Shengjin Wang

Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.


Author(s):  
Zerun Feng ◽  
Zhimin Zeng ◽  
Caili Guo ◽  
Zheng Li

Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.


2019 ◽  
Author(s):  
Hongyin Luo ◽  
Mitra Mohtarami ◽  
James Glass ◽  
Karthik Krishnamurthy ◽  
Brigitte Richardson

Author(s):  
Jianfeng Dong ◽  
Xirong Li ◽  
Chaoxi Xu ◽  
Xun Yang ◽  
Gang Yang ◽  
...  
Keyword(s):  

Author(s):  
Feng He ◽  
Qi Wang ◽  
Zhifan Feng ◽  
Wenbin Jiang ◽  
Yajuan Lü ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document