Multimodal feature fusion by relational reasoning and attention for visual question answering

2020 ◽  
Vol 55 ◽  
pp. 116-126 ◽  
Author(s):  
Weifeng Zhang ◽  
Jing Yu ◽  
Hua Hu ◽  
Haiyang Hu ◽  
Zengchang Qin
2021 ◽  
Vol 7 ◽  
pp. e353
Author(s):  
Zhiyang Ma ◽  
Wenfeng Zheng ◽  
Xiaobing Chen ◽  
Lirong Yin

The existing joint embedding Visual Question Answering models use different combinations of image characterization, text characterization and feature fusion method, but all the existing models use static word vectors for text characterization. However, in the real language environment, the same word may represent different meanings in different contexts, and may also be used as different grammatical components. These differences cannot be effectively expressed by static word vectors, so there may be semantic and grammatical deviations. In order to solve this problem, our article constructs a joint embedding model based on dynamic word vector—none KB-Specific network (N-KBSN) model which is different from commonly used Visual Question Answering models based on static word vectors. The N-KBSN model consists of three main parts: question text and image feature extraction module, self attention and guided attention module, feature fusion and classifier module. Among them, the key parts of N-KBSN model are: image characterization based on Faster R-CNN, text characterization based on ELMo and feature enhancement based on multi-head attention mechanism. The experimental results show that the N-KBSN constructed in our experiment is better than the other 2017—winner (glove) model and 2019—winner (glove) model. The introduction of dynamic word vector improves the accuracy of the overall results.


Author(s):  
Mingrui Lao ◽  
Yanming Guo ◽  
Hui Wang ◽  
Xin Zhang

Visual question answering (VQA) is receiving increasing attention from researchers in both the computer vision and natural language processing fields. There are two key components in the VQA task: feature extraction and multi-modal fusion. For feature extraction, we introduce a novel co-attention scheme by combining Sentence-guide Word Attention (SWA) and Question-guide Image Attention (QIA) in a unified framework. To be specific, the textual attention SWA relies on the semantics of the whole question sentence to calculate contributions of different question words for text representation. For the multi-modal fusion, we propose a “Cross-modal Multistep Fusion (CMF)” network to generate multistep features and achieve multiple interactions for two modalities, rather than focusing on modeling complex interactions between two modals like most current feature fusion methods. To avoid the linear increase of the computational cost, we share the parameters for each step in the CMF. Extensive experiments demonstrate that the proposed method can achieve competitive or better performance than the state-of-the-art.


Sign in / Sign up

Export Citation Format

Share Document