Multimodal language processing: How preceding discourse constrains gesture interpretation and affects gesture integration when gestures do not synchronise with semantic affiliates

Most manual communicative gestures that humans produce cannot be looked up in a dictionary, as these manual gestures inherit their meaning in large part from the communicative context and are not conventionalized. However, it is understudied to what extent the communicative signal as such — bodily postures in movement, or kinematics — can inform about gesture semantics. Can we construct, in principle, a distribution-based semantics of gesture kinematics, similar to how word vectorization methods in NLP (Natural language Processing) are now widely used to study semantic properties in text and speech? For such a project to get off the ground, we need to know the extent to which semantically similar gestures are more likely to be kinematically similar. In study 1 we assess whether semantic word2vec distances between the conveyed concepts participants were explicitly instructed to convey in silent gestures, relate to the kinematic distances of these gestures as obtained from Dynamic Time Warping (DTW). In a second director-matcher dyadic study we assess kinematic similarity between spontaneous co-speech gestures produced between interacting participants. Participants were asked before and after they interacted how they would name the objects. The semantic distances between the resulting names were related to the gesture kinematic distances of gestures that were made in the context of conveying those objects in the interaction. We find that the gestures’ semantic relatedness is reliably predictive of kinematic relatedness across these highly divergent studies, which suggests that the development of an NLP method of deriving semantic relatedness from kinematics is a promising avenue for future developments in automated multimodal recognition. Deeper implications for statistical learning processes in multimodal language are discussed.

Download Full-text

Multimodal Language Processing in Human Communication

Trends in Cognitive Sciences ◽

10.1016/j.tics.2019.05.006 ◽

2019 ◽

Vol 23 (8) ◽

pp. 639-652 ◽

Cited By ~ 23

Author(s):

Judith Holler ◽

Stephen C. Levinson

Keyword(s):

Language Processing ◽

Human Communication ◽

Multimodal Language

Download Full-text

Reliance on Visible Speech Cues During Multimodal Language Processing: Individual and Age Differences

Experimental Aging Research ◽

10.1080/03610730701525303 ◽

2007 ◽

Vol 33 (4) ◽

pp. 373-397 ◽

Cited By ~ 7

Author(s):

L. Thompson ◽

E. Garcia ◽

D. Malloy

Keyword(s):

Age Differences ◽

Language Processing ◽

Visible Speech ◽

Speech Cues ◽

Multimodal Language

Download Full-text

Electrophysiological Evidence for Reversed Lexical Repetition Effects in Language Processing

Journal of Cognitive Neuroscience ◽

10.1162/089892904970744 ◽

2004 ◽

Vol 16 (5) ◽

pp. 715-726 ◽

Cited By ~ 49

Author(s):

Tamara Y. Swaab ◽

C. Christine Camblin ◽

Peter C. Gordon

Keyword(s):

Language Processing ◽

Noun Phrase ◽

N400 Effect ◽

Repetition Effects ◽

Lag Effects ◽

Word Repetition ◽

The Subject ◽

Preceding Sentence ◽

Language Context ◽

Preceding Discourse

Effects of word repetition are extremely robust, but can these effects be modulated by discourse context? We examined this in an ERP experiment that tested coreferential processing (when two expressions refer to the same person) with repeated names. ERPs were measured to repeated names and pronoun controls in two conditions: (1) In the prominent condition the repeated name or pronoun coreferred with the subject of the preceding sentence and was therefore prominent in the preceding discourse (e.g., “John went to the store after John/he …”); (2) in the nonprominent condition the repeated name or pronoun coreferred with a name that was embedded in a conjoined noun phrase, and was therefore nonprominent (e.g., “John and Mary went to the store after John/he …”). Relative to the prominent condition, the nonprominent condition always contained two extra words (e.g., “and Mary”), and the repetition lag was therefore smaller in the prominent condition. Typically, effects of repetition are larger with smaller lags. Nevertheless, the amplitude of the N400 was reduced to a coreferentially repeated name when the antecedent was nonprominent as compared to when it was prominent. No such difference was observed for the pronoun controls. Because the N400 effect reflects difficulties in lexical integration, this shows that the difficulty of achieving coreference with a name increased with the prominence of the referent. This finding is the reverse of repetition lag effects on N400 previously found with word lists, and shows that language context can override general memory mechanisms.

Download Full-text

Multilingual Multimodal Language Processing Using Neural Networks

10.18653/v1/n16-4002 ◽

2016 ◽

Author(s):

Mitesh M. Khapra ◽

Sarath Chandar

Keyword(s):

Neural Networks ◽

Language Processing ◽

Multimodal Language

Download Full-text

Robust Understanding in Multimodal Interfaces

Computational Linguistics ◽

10.1162/coli.08-022-r2-06-26 ◽

2009 ◽

Vol 35 (3) ◽

pp. 345-397 ◽

Cited By ~ 14

Author(s):

Srinivas Bangalore ◽

Michael Johnston

Keyword(s):

Language Processing ◽

Interactive Systems ◽

Training Data ◽

Language Models ◽

Multimodal Interfaces ◽

Multiple Input ◽

Finite State ◽

Multimodal Language ◽

Multimodal Training ◽

Pen Input

Multimodal grammars provide an effective mechanism for quickly creating integration and understanding capabilities for interactive systems supporting simultaneous use of multiple input modalities. However, like other approaches based on hand-crafted grammars, multimodal grammars can be brittle with respect to unexpected, erroneous, or disfluent input. In this article, we show how the finite-state approach to multimodal language processing can be extended to support multimodal applications combining speech with complex freehand pen input, and evaluate the approach in the context of a multimodal conversational system (MATCH). We explore a range of different techniques for improving the robustness of multimodal integration and understanding. These include techniques for building effective language models for speech recognition when little or no multimodal training data is available, and techniques for robust multimodal understanding that draw on classification, machine translation, and sequence edit methods. We also explore the use of edit-based methods to overcome mismatches between the gesture stream and the speech stream.

Download Full-text

Language processing is not a race against time

Behavioral and Brain Sciences ◽

10.1017/s0140525x15000692 ◽

2016 ◽

Vol 39 ◽

Cited By ~ 1

Author(s):

Giosuè Baggio ◽

Carmelo M. Vicario

Keyword(s):

Information Processing ◽

Human Brain ◽

Language Processing ◽

Cognitive Functions ◽

Memory Systems ◽

Language System ◽

Moderating Factors

AbstractWe agree with Christiansen & Chater (C&C) that language processing and acquisition are tightly constrained by the limits of sensory and memory systems. However, the human brain supports a range of cognitive functions that mitigate the effects of information processing bottlenecks. The language system is partly organised around these moderating factors, not just around restrictions on storage and computation.

Download Full-text

Cognitive Processing of Miscommunication in Interactive Listening: An Evaluation of Listener Indecision and Cognitive Effort

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-20-00128 ◽

2021 ◽

pp. 1-17

Author(s):

Jennifer M. Roche ◽

Arkady Zgonnikov ◽

Laura M. Morett

Keyword(s):

Decision Making ◽

Cognitive Processing ◽

Language Processing ◽

Cognitive Effort ◽

Mouse Tracking ◽

Computer Mouse ◽

Decision Making Processes ◽

Computer Mouse Tracking ◽

Listening Task ◽

Processing Effort

Purpose The purpose of the current study was to evaluate the social and cognitive underpinnings of miscommunication during an interactive listening task. Method An eye and computer mouse–tracking visual-world paradigm was used to investigate how a listener's cognitive effort (local and global) and decision-making processes were affected by a speaker's use of ambiguity that led to a miscommunication. Results Experiments 1 and 2 found that an environmental cue that made a miscommunication more or less salient impacted listener language processing effort (eye-tracking). Experiment 2 also indicated that listeners may develop different processing heuristics dependent upon the speaker's use of ambiguity that led to a miscommunication, exerting a significant impact on cognition and decision making. We also found that perspective-taking effort and decision-making complexity metrics (computer mouse tracking) predict language processing effort, indicating that instances of miscommunication produced cognitive consequences of indecision, thinking, and cognitive pull. Conclusion Together, these results indicate that listeners behave both reciprocally and adaptively when miscommunications occur, but the way they respond is largely dependent upon the type of ambiguity and how often it is produced by the speaker.

Download Full-text