From image to language and back again
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).