Visual Question Answering Based on Question Attention Model

Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.

Download Full-text

Generative Attention Model with Adversarial Self-learning for Visual Question Answering

Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17 ◽

10.1145/3126686.3126695 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ilija Ilievski ◽

Jiashi Feng

Keyword(s):

Question Answering ◽

Attention Model ◽

Visual Question Answering ◽

Self Learning

Download Full-text

Dynamic Capsule Attention for Visual Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019324 ◽

2019 ◽

Vol 33 ◽

pp. 9324-9331 ◽

Cited By ~ 4

Author(s):

Yiyi Zhou ◽

Rongrong Ji ◽

Jinsong Su ◽

Xiaoshuai Sun ◽

Weiqiu Chen

Keyword(s):

Question Answering ◽

Visual Features ◽

Coupling Coefficients ◽

Image Captioning ◽

Attention Model ◽

Visual Question Answering ◽

Training Examples ◽

Attention Modeling ◽

Projection Matrices ◽

Simple Network

In visual question answering (VQA), recent advances have well advocated the use of attention mechanism to precisely link the question to the potential answer areas. As the difficulty of the question increases, more VQA models adopt multiple attention layers to capture the deeper visual-linguistic correlation. But a negative consequence is the explosion of parameters, which makes the model vulnerable to over-fitting, especially when limited training examples are given. In this paper, we propose an extremely compact alternative to this static multi-layer architecture towards accurate yet efficient attention modeling, termed as Dynamic Capsule Attention (CapsAtt). Inspired by the recent work of Capsule Network, CapsAtt treats visual features as capsules and obtains the attention output via dynamic routing, which updates the attention weights by calculating coupling coefficients between the underlying and output capsules. Meanwhile, CapsAtt also discards redundant projection matrices to make the model much more compact. We quantify CapsAtt on three benchmark VQA datasets, i.e., COCO-QA, VQA1.0 and VQA2.0. Compared to the traditional multi-layer attention model, CapsAtt achieves significant improvements of up to 4.1%, 5.2% and 2.2% on three datasets, respectively. Moreover, with much fewer parameters, our approach also yields competitive results compared to the latest VQA models. To further verify the generalization ability of CapsAtt, we also deploy it on another challenging multi-modal task of image captioning, where state-of-the-art performance is achieved with a simple network structure.

Download Full-text

An Effective Dense Co-Attention Networks for Visual Question Answering

Sensors ◽

10.3390/s20174897 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4897 ◽

Cited By ~ 2

Author(s):

Shirong He ◽

Dezhi Han

Keyword(s):

Question Answering ◽

Short Term Memory ◽

Visual Object ◽

Attention Networks ◽

Attention Model ◽

Fine Grained ◽

Visual Question Answering ◽

Questions And Answers ◽

Long Short Term Memory ◽

The Relationship

At present, the state-of-the-art approaches of Visual Question Answering (VQA) mainly use the co-attention model to relate each visual object with text objects, which can achieve the coarse interactions between multimodalities. However, they ignore the dense self-attention within question modality. In order to solve this problem and improve the accuracy of VQA tasks, in the present paper, an effective Dense Co-Attention Networks (DCAN) is proposed. First, to better capture the relationship between words that are relatively far apart and make the extracted semantics more robust, the Bidirectional Long Short-Term Memory (Bi-LSTM) neural network is introduced to encode questions and answers; second, to realize the fine-grained interactions between the question words and image regions, a dense multimodal co-attention model is proposed. The model’s basic components include the self-attention unit and the guided-attention unit, which are cascaded in depth to form a hierarchical structure. The experimental results on the VQA-v2 dataset show that DCAN has obvious performance advantages, which makes VQA applicable to a wider range of AI scenarios.

Download Full-text