multimodal attention Latest Research Papers

Infant language acquisition is fundamentally an embodied process, relying on the body to select information from the learning environment. Infants show their attention to an object not merely by gazing at the object, but also through orienting their body towards the object and generating various types of manual actions on the object, such as holding, touching, and shaking. The goal of the present study was to examine how multimodal attention shapes infant word learning in real-time. Infants and their parents played in a home-like lab with unfamiliar objects with assigned labels. While playing, participants wore wireless head-mounted eye trackers to capture visual attention. Infants were then tested on their knowledge of the new words. We identified all the utterances in which parents labeled the learned or not learned objects and analyzed infant multimodal attention during and around labeling. We found that proportion of time spent in hand-eye coordination predicted learning outcomes. To understand the learning advantage hand-eye coordination creates, we compared the size of objects in the infant’s field of view. Although there were no differences in object size between learned and not learned labeling utterances, hand-eye coordination created the most informative views. Together, these results suggest that in-the-moment word learning may be driven by the greater access to informative object views that hand-eye coordination affords.

Download Full-text

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3447685 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-19

Author(s):

Fenglin Liu ◽

Xian Wu ◽

Shen Ge ◽

Xuancheng Ren ◽

Wei Fan ◽

...

Keyword(s):

Visual Information ◽

Language Modeling ◽

Systematic Analysis ◽

Visual Storytelling ◽

Fine Grained ◽

Latent Space ◽

Visual Concepts ◽

Representative Task ◽

Multimodal Attention ◽

Vision And Language

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space . To tackle this problem, we propose DiMBERT (short for Di sentangled M ultimodal-Attention BERT ), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Di sentangled M ultimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

Download Full-text