Graph Convolutional Network for Visual Question Answering Based on Fine-grained Question Representation

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.

Download Full-text

Plenty Is Plague: Fine-Grained Learning for Visual Question Answering

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2019.2956699 ◽

2019 ◽

pp. 1-1

Author(s):

Yiyi Zhou ◽

Rongrong Ji ◽

Xiaoshuai Sun ◽

Jinsong Su ◽

Deyu Meng ◽

...

Keyword(s):

Question Answering ◽

Fine Grained ◽

Visual Question Answering

Download Full-text

Double-layer affective visual question answering network

Computer Science and Information Systems ◽

10.2298/csis200515038g ◽

2020 ◽

pp. 38-38

Author(s):

Zihan Guo ◽

Dezhi Han ◽

Kuan-Ching Li

Keyword(s):

Double Layer ◽

Language Processing ◽

Affective Computing ◽

Question Answering ◽

Performance Comparison ◽

Image Feature ◽

Sources Of Information ◽

Emotional Information ◽

Fine Grained ◽

Visual Question Answering

Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accuracy as ordinary VQA baseline models. It is a reasonably new task to integrate the emotional information contained in the images into VQA. However, it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in AVQAN. Also, it is believed that this type of concatenation is harmful to the performance of the model. To mitigate such an effect, we propose the Double-Layer Affective Visual Question Answering Network (DAVQAN) that divides the task of generating emotional answers in VQA into two simpler subtasks: the generation of non-emotional responses and the production of mood labels, and two independent layers are utilized to tackle these subtasks. Comparative experimentation conducted on a preprocessed dataset to performance comparison shows that the overall performance of DAVQAN is 7.6% higher than AVQAN, demonstrating the effectiveness of the proposed model. We also introduce more advanced word embedding method and more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models, which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like the general VQA.

Download Full-text

Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

10.18653/v1/2020.acl-main.642 ◽

2020 ◽

Author(s):

Qingbao Huang ◽

Jielong Wei ◽

Yi Cai ◽

Changmeng Zheng ◽

Junying Chen ◽

...

Keyword(s):

Question Answering ◽

Convolutional Network ◽

Visual Question Answering ◽

Dual Channel

Download Full-text

Fine-Grained Unbalanced Interaction Network for Visual Question Answering

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82153-1_8 ◽

2021 ◽

pp. 85-97

Author(s):

Xinxin Liao ◽

Mingyan Wu ◽

Heyan Chai ◽

Shuhan Qi ◽

Xuan Wang ◽

...

Keyword(s):

Question Answering ◽

Interaction Network ◽

Fine Grained ◽

Visual Question Answering

Download Full-text

An Effective Dense Co-Attention Networks for Visual Question Answering

Sensors ◽

10.3390/s20174897 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4897 ◽

Cited By ~ 2

Author(s):

Shirong He ◽

Dezhi Han

Keyword(s):

Question Answering ◽

Short Term Memory ◽

Visual Object ◽

Attention Networks ◽

Attention Model ◽

Fine Grained ◽

Visual Question Answering ◽

Questions And Answers ◽

Long Short Term Memory ◽

The Relationship

At present, the state-of-the-art approaches of Visual Question Answering (VQA) mainly use the co-attention model to relate each visual object with text objects, which can achieve the coarse interactions between multimodalities. However, they ignore the dense self-attention within question modality. In order to solve this problem and improve the accuracy of VQA tasks, in the present paper, an effective Dense Co-Attention Networks (DCAN) is proposed. First, to better capture the relationship between words that are relatively far apart and make the extracted semantics more robust, the Bidirectional Long Short-Term Memory (Bi-LSTM) neural network is introduced to encode questions and answers; second, to realize the fine-grained interactions between the question words and image regions, a dense multimodal co-attention model is proposed. The model’s basic components include the self-attention unit and the guided-attention unit, which are cascaded in depth to form a hierarchical structure. The experimental results on the VQA-v2 dataset show that DCAN has obvious performance advantages, which makes VQA applicable to a wider range of AI scenarios.

Download Full-text

Densely Connected Attention Flow for Visual Question Answering

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/122 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fei Liu ◽

Jing Liu ◽

Zhiwei Fang ◽

Richang Hong ◽

Hanqing Lu

Keyword(s):

Question Answering ◽

State Of The Art ◽

Experimental Results ◽

Effective Interactions ◽

Fine Grained ◽

Visual Question Answering ◽

Complex Image ◽

Hierarchical Levels ◽

Answering Questions ◽

Common Defect

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.

Download Full-text