Visual question answering: a state-of-the-art review

Sruthy Manmadhan; Binsu C. Kovoor

doi:10.1007/s10462-020-09832-7

Unified Vision-Language Pre-Training for Image Captioning and VQA

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.7005 ◽

2020 ◽

Vol 34 (07) ◽

pp. 13041-13049 ◽

Cited By ~ 11

Author(s):

Luowei Zhou ◽

Hamid Palangi ◽

Lei Zhang ◽

Houdong Hu ◽

Jason Corso ◽

...

Keyword(s):

Unsupervised Learning ◽

Question Answering ◽

State Of The Art ◽

Learning Objectives ◽

Image Captioning ◽

Language Generation ◽

Visual Question Answering ◽

Benchmark Datasets

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Download Full-text

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018658 ◽

2019 ◽

Vol 33 ◽

pp. 8658-8665 ◽

Cited By ~ 10

Author(s):

Xiangpeng Li ◽

Jingkuan Song ◽

Lianli Gao ◽

Xianglong Liu ◽

Wenbing Huang ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Computation Time ◽

Comparable Result ◽

Video Encoding ◽

Visual Question Answering ◽

Proposed Model ◽

Ablation Study ◽

The Given ◽

Video Question Answering

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Download Full-text

KVQA: Knowledge-Aware Visual Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018876 ◽

2019 ◽

Vol 33 ◽

pp. 8876-8884 ◽

Cited By ~ 6

Author(s):

Sanket Shah ◽

Anand Mishra ◽

Naganand Yadati ◽

Partha Pratim Talukdar

Keyword(s):

Language Processing ◽

Question Answering ◽

State Of The Art ◽

World Knowledge ◽

Named Entities ◽

Commonsense Knowledge ◽

White House ◽

Visual Question Answering ◽

Knowledge Graphs ◽

Answering Questions

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about an image which can be answered purely based on its content. For example, given an image with people in it, a typical VQA question may inquire about the number of people in the image. More recently, there is growing interest in answering questions which require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present in the image. In spite of this progress, the important problem of answering questions requiring world knowledge about named entities (e.g., Barack Obama, White House, United Nations) in the image has not been addressed in prior research. We address this gap in this paper, and introduce KVQA – the first dataset for the task of (world) knowledge-aware VQA. KVQA consists of 183K question-answer pairs involving more than 18K named entities and 24K images. Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. To the best of our knowledge, KVQA is the largest dataset for exploring VQA over KG. Further, we also provide baseline performances using state-of-the-art methods on KVQA.

Download Full-text

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/153 ◽

2020 ◽

Author(s):

Zihao Zhu ◽

Jing Yu ◽

Yujing Wang ◽

Yajing Sun ◽

Yue Hu ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Relevant Information ◽

Convolutional Network ◽

Graph Representations ◽

Fine Grained ◽

Visual Question Answering ◽

Knowledge Reasoning ◽

Final Answer ◽

The Given

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.

Download Full-text

Densely Connected Attention Flow for Visual Question Answering

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/122 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fei Liu ◽

Jing Liu ◽

Zhiwei Fang ◽

Richang Hong ◽

Hanqing Lu

Keyword(s):

Question Answering ◽

State Of The Art ◽

Experimental Results ◽

Effective Interactions ◽

Fine Grained ◽

Visual Question Answering ◽

Complex Image ◽

Hierarchical Levels ◽

Answering Questions ◽

Common Defect

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.

Download Full-text

Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/151 ◽

2020 ◽

Author(s):

Xi Zhu ◽

Zhendong Mao ◽

Chunxiao Liu ◽

Peng Zhang ◽

Bin Wang ◽

...

Keyword(s):

Supervised Learning ◽

High Frequency ◽

Question Answering ◽

State Of The Art ◽

Experimental Results ◽

Learning Framework ◽

Visual Question Answering ◽

Art Performance

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e.g., what color is the banana?) based on the high-frequency answers (e.g., yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency and strengthen image dependency. However, they are still subject to the language prior problem since the data biases have not been fundamentally addressed. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and then propose a self-supervised auxiliary task to utilize the balanced data to assist the VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method achieves state-of-the-art performance, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations. Our code is available on GitHub.

Download Full-text

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/104 ◽

2020 ◽

Author(s):

Ganchao Tan ◽

Daqing Liu ◽

Meng Wang ◽

Zheng-Jun Zha

Keyword(s):

Natural Language ◽

Question Answering ◽

State Of The Art ◽

Temporal Reasoning ◽

Generation Process ◽

Video Captioning ◽

Visual Question Answering ◽

Discrete Module ◽

The Subject ◽

Spatio Temporal

Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence “a man is shooting a basketball”, we need to first locate and describe the subject “man”, next reason out the man is “shooting”, then describe the object “basketball” of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. Our code is available at https://github.com/tgc1997/RMN.

Download Full-text

Re-Attention for Visual Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5338 ◽

2020 ◽

Vol 34 (01) ◽

pp. 91-98 ◽

Cited By ~ 1

Author(s):

Wenya Guo ◽

Ying Zhang ◽

Xiaoping Wu ◽

Jufeng Yang ◽

Xiangrui Cai ◽

...

Keyword(s):

Key Words ◽

Question Answering ◽

State Of The Art ◽

Feature Space ◽

The State ◽

Well Performance ◽

Visual Objects ◽

Visual Question Answering ◽

Rich Information ◽

Satisfactory Answer

Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a re-attention framework for the VQA task. We first associate image and question by calculating the similarity of each object-word pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Benefiting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

Download Full-text