Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?

Author(s):  
Shurong Sheng ◽  
Katrien Laenen ◽  
Marie-Francine Moens
AI Magazine ◽  
2016 ◽  
Vol 37 (1) ◽  
pp. 63-72 ◽  
Author(s):  
C. Lawrence Zitnick ◽  
Aishwarya Agrawal ◽  
Stanislaw Antol ◽  
Margaret Mitchell ◽  
Dhruv Batra ◽  
...  

As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.


Author(s):  
José Manuel Gómez Soriano ◽  
Manuel Montes y Gómez ◽  
Emilio Sanchis Arnal ◽  
Paolo Rosso

2020 ◽  
Vol 34 (07) ◽  
pp. 13041-13049 ◽  
Author(s):  
Luowei Zhou ◽  
Hamid Palangi ◽  
Lei Zhang ◽  
Houdong Hu ◽  
Jason Corso ◽  
...  

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.


Author(s):  
Carlos G. Figuerola ◽  
Angel F. Zazo ◽  
José L. Alonso Berrocal ◽  
Emilio Rodríguez Vázquez de Aldana

2020 ◽  
Vol 34 (07) ◽  
pp. 12225-12232
Author(s):  
Weitao Wang ◽  
Meng Wang ◽  
Sen Wang ◽  
Guodong Long ◽  
Lina Yao ◽  
...  

The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in < object-predicate-object > form; for example, < person-lean on-wall > . This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2.8% to 12.2% compared with state-of-the-art baselines.


Sign in / Sign up

Export Citation Format

Share Document