Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?

Measuring Machine Intelligence Through Visual Question Answering

AI Magazine ◽

10.1609/aimag.v37i1.2647 ◽

2016 ◽

Vol 37 (1) ◽

pp. 63-72 ◽

Cited By ~ 10

Author(s):

C. Lawrence Zitnick ◽

Aishwarya Agrawal ◽

Stanislaw Antol ◽

Margaret Mitchell ◽

Dhruv Batra ◽

...

Keyword(s):

Question Answering ◽

Machine Intelligence ◽

Image Captioning ◽

Visual Question Answering ◽

Language And Vision ◽

Measuring Machine

As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

Download Full-text

A Markov Network Based Passage Retrieval Method for Multimodal Question Answering in the Cultural Heritage Domain

MultiMedia Modeling - Lecture Notes in Computer Science ◽

10.1007/978-3-319-73603-7_1 ◽

2018 ◽

pp. 3-15 ◽

Cited By ~ 2

Author(s):

Shurong Sheng ◽

Aparna Nurani Venkitasubramanian ◽

Marie-Francine Moens

Keyword(s):

Cultural Heritage ◽

Question Answering ◽

Retrieval Method ◽

Markov Network ◽

Passage Retrieval

Download Full-text

A Passage Retrieval System for Multilingual Question Answering

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/11551874_57 ◽

2005 ◽

pp. 443-450 ◽

Cited By ~ 13

Author(s):

José Manuel Gómez Soriano ◽

Manuel Montes y Gómez ◽

Emilio Sanchis Arnal ◽

Paolo Rosso

Keyword(s):

Question Answering ◽

Retrieval System ◽

Passage Retrieval

Download Full-text

Unified Vision-Language Pre-Training for Image Captioning and VQA

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.7005 ◽

2020 ◽

Vol 34 (07) ◽

pp. 13041-13049 ◽

Cited By ~ 11

Author(s):

Luowei Zhou ◽

Hamid Palangi ◽

Lei Zhang ◽

Houdong Hu ◽

Jason Corso ◽

...

Keyword(s):

Unsupervised Learning ◽

Question Answering ◽

State Of The Art ◽

Learning Objectives ◽

Image Captioning ◽

Language Generation ◽

Visual Question Answering ◽

Benchmark Datasets

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Download Full-text

Interactive and Bilingual Question Answering Using Term Suggestion and Passage Retrieval

Multilingual Information Access for Text, Speech and Images - Lecture Notes in Computer Science ◽

10.1007/11519645_37 ◽

2005 ◽

pp. 363-370 ◽

Cited By ~ 1

Author(s):

Carlos G. Figuerola ◽

Angel F. Zazo ◽

José L. Alonso Berrocal ◽

Emilio Rodríguez Vázquez de Aldana

Keyword(s):

Question Answering ◽

Passage Retrieval

Download Full-text

One-Shot Learning for Long-Tail Visual Relation Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6904 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12225-12232

Author(s):

Weitao Wang ◽

Meng Wang ◽

Sen Wang ◽

Guodong Long ◽

Lina Yao ◽

...

Keyword(s):

Question Answering ◽

Dual Graph ◽

Image Captioning ◽

Long Tail ◽

Training Scheme ◽

Training Samples ◽

Latent Features ◽

The One ◽

Novel Model ◽

Conventional Detection

The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in < object-predicate-object > form; for example, < person-lean on-wall > . This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2.8% to 12.2% compared with state-of-the-art baselines.

Download Full-text