Visual Question Answering Through Adversarial Learning of Multi-modal Representation

10.36227/techrxiv.12731948.v1 ◽

2020 ◽

Author(s):

Iqbal Chowdhury ◽

Kien Nguyen Thanh ◽

Clinton fookes ◽

Sridha Sridharan

Keyword(s):

Natural Language ◽

Question Answering ◽

Feature Representation ◽

Fusion Method ◽

Adversarial Learning ◽

Visual Question Answering ◽

Proposed Model ◽

Fusion Methods ◽

Adversarial Training ◽

Multimodal Representation

Solving the Visual Question Answering (VQA) task is a step towards achieving human-like reasoning capability of the machines. This paper proposes an approach to learn multimodal feature representation with adversarial training. The purpose of the adversarial training allows the model to learn from standard fusion methods in an unsupervised manner. The discriminator model is equipped with a siamese combinatin of two standard fusion method namely multimodal compact bilinear pooling and multimodal tucker fusion. Output multimodal feature representation from generator is a resultant of graph convolutional operation. The resultant multimodal representation of the adversarial training allows the proposed model to infer the correct answers from open-ended natural language questions from the VQA 2.0 dataset. An overall accuracy of 69.86\% demonstrates the accuracy of the proposed model.

Download Full-text

Computational construction grammar for visual question answering

Linguistics Vanguard ◽

10.1515/lingvan-2018-0070 ◽

2019 ◽

Vol 5 (1) ◽

Author(s):

Jens Nevens ◽

Paul Van Eecke ◽

Katrien Beuls

Keyword(s):

Natural Language ◽

Question Answering ◽

Semantic Representation ◽

Construction Grammar ◽

Training Data ◽

Knowledge Sources ◽

Visual Question Answering ◽

Novel Approach ◽

Natural Language Question ◽

Grammar Model

AbstractIn order to be able to answer a natural language question, a computational system needs three main capabilities. First, the system needs to be able to analyze the question into a structured query, revealing its component parts and how these are combined. Second, it needs to have access to relevant knowledge sources, such as databases, texts or images. Third, it needs to be able to execute the query on these knowledge sources. This paper focuses on the first capability, presenting a novel approach to semantically parsing questions expressed in natural language. The method makes use of a computational construction grammar model for mapping questions onto their executable semantic representations. We demonstrate and evaluate the methodology on the CLEVR visual question answering benchmark task. Our system achieves a 100% accuracy, effectively solving the language understanding part of the benchmark task. Additionally, we demonstrate how this solution can be embedded in a full visual question answering system, in which a question is answered by executing its semantic representation on an image. The main advantages of the approach include (i) its transparent and interpretable properties, (ii) its extensibility, and (iii) the fact that the method does not rely on any annotated training data.

Download Full-text

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018658 ◽

2019 ◽

Vol 33 ◽

pp. 8658-8665 ◽

Cited By ~ 10

Author(s):

Xiangpeng Li ◽

Jingkuan Song ◽

Lianli Gao ◽

Xianglong Liu ◽

Wenbing Huang ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Computation Time ◽

Comparable Result ◽

Video Encoding ◽

Visual Question Answering ◽

Proposed Model ◽

Ablation Study ◽

The Given ◽

Video Question Answering

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Download Full-text

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

10.1007/978-3-030-88361-4_7 ◽

2021 ◽

pp. 111-127

Author(s):

Rajat Koner ◽

Hang Li ◽

Marcel Hildebrandt ◽

Deepan Das ◽

Volker Tresp ◽

...

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Human Performance ◽

Question Answering ◽

Scene Graph ◽

Visual Question Answering ◽

Learning Agent ◽

Modal Reasoning

AbstractVisual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.

Download Full-text

Explicit Knowledge-based Reasoning for Visual Question Answering

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/179 ◽

2017 ◽

Cited By ~ 14

Author(s):

Peng Wang ◽

Qi Wu ◽

Chunhua Shen ◽

Anthony Dick ◽

Anton van den Hengel

Keyword(s):

Natural Language ◽

Large Scale ◽

Question Answering ◽

Explicit Knowledge ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Visual Question Answering ◽

Knowledge Based ◽

Long Short Term Memory

We describe a method for visual question answering which is capable of reasoning about an image on the basis of information extracted from a large-scale knowledge base. The method not only answers natural language questions using concepts not contained in the image, but can explain the reasoning by which it developed its answer. It is capable of answering far more complex questions than the predominant long short-term memory-based approach, and outperforms it significantly in testing. We also provide a dataset and a protocol by which to evaluate general visual question answering methods.

Download Full-text

Multiple Context Learning Networks for Visual Question Answering

10.21203/rs.3.rs-955099/v1 ◽

2021 ◽

Author(s):

Pufen Zhang ◽

Hong Lan

Keyword(s):

Question Answering ◽

Learning Strategy ◽

Learning Networks ◽

Learning Modules ◽

Visual Question Answering ◽

Context Learning ◽

Multiple Context ◽

Proposed Model ◽

Previous State ◽

High Level

Abstract In recently years, some visual question answering (VQA) methods that emphasize the simultaneous understanding of both the context of image and question have been proposed. Despite the effectiveness of these methods, they fail to explore a more comprehensive and generalized context learning tactics. To address this issue, we propose a novel Multiple Context Learning Networks (MCLN) to model the multiple contexts for VQA. Three kinds of contexts are investigated, namely visual context, textual context and a special visual-textual context that ignored by previous methods. Moreover, three corresponding context learning modules are proposed. These modules endow image and text representations with context-aware information based on a uniform context learning strategy. And they work together to form a multiple context learning layer (MCL). Such MCL can be stacked in depth and which describe high-level context information by associating intra-modal contexts with inter-modal context. On the VQA v2.0 datasets, the proposed model achieves 71.05% and 71.48% on test-dev set and test-std set respectively, and gains better performance than the previous state-of-the-art methods. In addition, extensive ablation studies have been carried out to examine the effectiveness of the proposed method.

Download Full-text

Natural Language Processing based Visual Question Answering Efficient: an EfficientDet Approach

2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS) ◽

10.1109/iciccs48265.2020.9121068 ◽

2020 ◽

Author(s):

Rahul Gupta ◽

Parikshit Hooda ◽

Sanjeev ◽

Nikhil Kumar Chikkara

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Visual Question Answering

Download Full-text

ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2020.3029423 ◽

2020 ◽

pp. 1-14

Author(s):

Yun Liu ◽

Xiaoming Zhang ◽

Zhiyun Zhao ◽

Bo Zhang ◽

Lei Cheng ◽

...

Keyword(s):

Question Answering ◽

Adversarial Learning ◽

Visual Question Answering

Download Full-text

Adversarial Learning With Multi-Modal Attention for Visual Question Answering

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2020.3016083 ◽

2020 ◽

pp. 1-15

Author(s):

Yun Liu ◽

Xiaoming Zhang ◽

Feiran Huang ◽

Lei Cheng ◽

Zhoujun Li

Keyword(s):

Question Answering ◽

Adversarial Learning ◽

Visual Question Answering

Download Full-text

BETTER GENERIC OBJECTS COUNTING WHEN ASKING QUESTIONS TO IMAGES: A MULTITASK APPROACH FOR REMOTE SENSING VISUAL QUESTION ANSWERING

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-1021-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 1021-1027

Author(s):

S. Lobry ◽

D. Marcos ◽

B. Kellenberger ◽

D. Tuia

Keyword(s):

Remote Sensing ◽

Natural Language ◽

Question Answering ◽

Classification Problem ◽

Remote Sensing Images ◽

Regression Problem ◽

Visual Question Answering ◽

Multiple Orders ◽

Asking Questions ◽

The Cost

Abstract. Visual Question Answering for Remote Sensing (RSVQA) aims at extracting information from remote sensing images through queries formulated in natural language. Since the answer to the query is also provided in natural language, the system is accessible to non-experts, and therefore dramatically increases the value of remote sensing images as a source of information, for example for journalism purposes or interactive land planning. Ideally, an RSVQA system should be able to provide an answer to questions that vary both in terms of topic (presence, localization, counting) and image content. However, aiming at such flexibility generates problems related to the variability of the possible answers. A striking example is counting, where the number of objects present in a remote sensing image can vary by multiple orders of magnitude, depending on both the scene and type of objects. This represents a challenge for traditional Visual Question Answering (VQA) methods, which either become intractable or result in an accuracy loss, as the number of possible answers has to be limited. To this end, we introduce a new model that jointly solves a classification problem (which is the most common approach in VQA) and a regression problem (to answer numerical questions more precisely). An evaluation of this method on the RSVQA dataset shows that this finer numerical output comes at the cost of a small loss of performance on non-numerical questions.

Download Full-text