scholarly journals When eyes beat lips: speaker gaze affects audiovisual integration in the McGurk illusion

Author(s):  
Basil Wahn ◽  
Laura Schmitz ◽  
Alan Kingstone ◽  
Anne Böckler-Raettig

AbstractEye contact is a dynamic social signal that captures attention and plays a critical role in human communication. In particular, direct gaze often accompanies communicative acts in an ostensive function: a speaker directs her gaze towards the addressee to highlight the fact that this message is being intentionally communicated to her. The addressee, in turn, integrates the speaker’s auditory and visual speech signals (i.e., her vocal sounds and lip movements) into a unitary percept. It is an open question whether the speaker’s gaze affects how the addressee integrates the speaker’s multisensory speech signals. We investigated this question using the classic McGurk illusion, an illusory percept created by presenting mismatching auditory (vocal sounds) and visual information (speaker’s lip movements). Specifically, we manipulated whether the speaker (a) moved his eyelids up/down (i.e., open/closed his eyes) prior to speaking or did not show any eye motion, and (b) spoke with open or closed eyes. When the speaker’s eyes moved (i.e., opened or closed) before an utterance, and when the speaker spoke with closed eyes, the McGurk illusion was weakened (i.e., addressees reported significantly fewer illusory percepts). In line with previous research, this suggests that motion (opening or closing), as well as the closed state of the speaker’s eyes, captured addressees’ attention, thereby reducing the influence of the speaker’s lip movements on the addressees’ audiovisual integration process. Our findings reaffirm the power of speaker gaze to guide attention, showing that its dynamics can modulate low-level processes such as the integration of multisensory speech signals.

2019 ◽  
Author(s):  
Violet Aurora Brown ◽  
Julia Feld Strand

The McGurk effect is a multisensory phenomenon in which discrepant auditory and visual speech signals typically result in an illusory percept (McGurk & MacDonald, 1976). McGurk stimuli are often used in studies assessing the attentional requirements of audiovisual integration (e.g., Alsius et al., 2005), but no study has directly compared the costs associated with integrating congruent versus incongruent audiovisual speech. Some evidence suggests that the McGurk effect may not be representative of naturalistic audiovisual speech processing—susceptibility to the McGurk effect is not associated with the ability to derive benefit from the addition of the visual signal (Van Engen et al., 2017), and distinct cortical regions are recruited when processing congruent versus incongruent speech (Erickson et al., 2014). In two experiments, one using response times to identify congruent and incongruent syllables and one using a dual-task paradigm, we assessed whether congruent and incongruent audiovisual speech incur different attentional costs. We demonstrated that response times to both the speech task (Experiment 1) and a secondary vibrotactile task (Experiment 2) were indistinguishable for congruent compared to incongruent syllables, but McGurk fusions were responded to more quickly than McGurk non-fusions. These results suggest that despite documented differences in how congruent and incongruent stimuli are processed (Erickson et al., 2014; Van Engen, Xie, & Chandrasekaran, 2017), they do not appear to differ in terms of processing time or effort. However, responses that result in McGurk fusions are processed more quickly than those that result in non-fusions, though attentional cost is comparable for the two response types.


2018 ◽  
Vol 5 (3) ◽  
pp. 170909 ◽  
Author(s):  
Claudia S. Lüttke ◽  
Alexis Pérez-Bellido ◽  
Floris P. de Lange

The human brain can quickly adapt to changes in the environment. One example is phonetic recalibration: a speech sound is interpreted differently depending on the visual speech and this interpretation persists in the absence of visual information. Here, we examined the mechanisms of phonetic recalibration. Participants categorized the auditory syllables /aba/ and /ada/, which were sometimes preceded by the so-called McGurk stimuli (in which an /aba/ sound, due to visual /aga/ input, is often perceived as ‘ada’). We found that only one trial of exposure to the McGurk illusion was sufficient to induce a recalibration effect, i.e. an auditory /aba/ stimulus was subsequently more often perceived as ‘ada’. Furthermore, phonetic recalibration took place only when auditory and visual inputs were integrated to ‘ada’ (McGurk illusion). Moreover, this recalibration depended on the sensory similarity between the preceding and current auditory stimulus. Finally, signal detection theoretical analysis showed that McGurk-induced phonetic recalibration resulted in both a criterion shift towards /ada/ and a reduced sensitivity to distinguish between /aba/ and /ada/ sounds. The current study shows that phonetic recalibration is dependent on the perceptual integration of audiovisual information and leads to a perceptual shift in phoneme categorization.


2015 ◽  
Vol 113 (7) ◽  
pp. 2342-2350 ◽  
Author(s):  
Yadira Roa Romero ◽  
Daniel Senkowski ◽  
Julian Keil

The McGurk illusion is a prominent example of audiovisual speech perception and the influence that visual stimuli can have on auditory perception. In this illusion, a visual speech stimulus influences the perception of an incongruent auditory stimulus, resulting in a fused novel percept. In this high-density electroencephalography (EEG) study, we were interested in the neural signatures of the subjective percept of the McGurk illusion as a phenomenon of speech-specific multisensory integration. Therefore, we examined the role of cortical oscillations and event-related responses in the perception of congruent and incongruent audiovisual speech. We compared the cortical activity elicited by objectively congruent syllables with incongruent audiovisual stimuli. Importantly, the latter elicited a subjectively congruent percept: the McGurk illusion. We found that early event-related responses (N1) to audiovisual stimuli were reduced during the perception of the McGurk illusion compared with congruent stimuli. Most interestingly, our study showed a stronger poststimulus suppression of beta-band power (13–30 Hz) at short (0–500 ms) and long (500–800 ms) latencies during the perception of the McGurk illusion compared with congruent stimuli. Our study demonstrates that auditory perception is influenced by visual context and that the subsequent formation of a McGurk illusion requires stronger audiovisual integration even at early processing stages. Our results provide evidence that beta-band suppression at early stages reflects stronger stimulus processing in the McGurk illusion. Moreover, stronger late beta-band suppression in McGurk illusion indicates the resolution of incongruent physical audiovisual input and the formation of a coherent, illusory multisensory percept.


2012 ◽  
Vol 25 (0) ◽  
pp. 186
Author(s):  
Daniel Drebing ◽  
Jared Medina ◽  
H. Branch Coslett ◽  
Jeffrey T. Shenton ◽  
Roy H. Hamilton

Integrating sensory information across modalities is necessary for a cohesive experience of the world; disrupting the ability to bind the multisensory stimuli arising from an event leads to a disjointed and confusing percept. We previously reported (Hamilton et al., 2006) a patient, AWF, who suffered an acute neural incident after which he displayed a distinct inability to integrate auditory and visual speech information. While our prior experiments involving AWF suggested that he had a deficit of audiovisual speech processing, they did not explore the hypothesis that his deficits in audiovisual integration are restricted to speech. In order to test this notion, we conducted a series of experiments aimed at testing AWF’s ability to integrate cross-modal information from both speech and non-speech events. AWF was tasked with making temporal order judgments (TOJs) for videos of object noises (such as hands clapping) or speech, wherein the onsets of auditory and visual information were manipulated. Results from the experiments show that while AWF performed worse than controls in his ability to accurately judge even the most salient onset differences for speech videos, he did not differ significantly from controls in his ability to make TOJs for the object videos. These results illustrate the possibility of disruption of intermodal binding for audiovisual speech events with spared binding for real-world, non-speech events.


2011 ◽  
Vol 26 (S2) ◽  
pp. 1512-1512
Author(s):  
G.R. Szycik ◽  
Z. Ye ◽  
B. Mohammadi ◽  
W. Dillo ◽  
B.T. te Wildt ◽  
...  

IntroductionNatural speech perception relies on both, auditory and visual information. Both sensory channels provide redundant and complementary information, such that speech perception is enhanced in healthy subjects, when both information channels are present.ObjectivesPatients with schizophrenia have been reported to have problems regarding this audiovisual integration process, but little is known about which neural processes are altered.AimsIn this study we investigated functional connectivity of Broca’s area in patients with schizophrenia.MethodsFunctional magnetic resonance imaging (fMRI) was performed in 15 schizophrenia patients and 15 healthy controls to study functional connectivity of Broca’s area during perception of videos of bisyllabic German nouns, in which audio and video either matched (congruent condition) or die not match (incongruent; e.g. video = hotel, audio = island).ResultsThere were differences in connectivity between experimental groups and between conditions. Broca’s area of the patient group showed connections to more brain areas than the control group. This difference was more prominent in the incongruent condition, for which only one connection between Broca's area and the supplementary motor area was found in control participants, whereas patients showed connections to 8 widely distributed brain areas.ConclusionsThe findings imply that audiovisual integration problems in schizophrenia result from maladaptive connectivity of Broca's area in particular when confronted with incongruent stimuli and are discussed in light of recent audio visual speech models.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246986
Author(s):  
Alma Lindborg ◽  
Tobias S. Andersen

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.


2009 ◽  
Vol 23 (2) ◽  
pp. 63-76 ◽  
Author(s):  
Silke Paulmann ◽  
Sarah Jessen ◽  
Sonja A. Kotz

The multimodal nature of human communication has been well established. Yet few empirical studies have systematically examined the widely held belief that this form of perception is facilitated in comparison to unimodal or bimodal perception. In the current experiment we first explored the processing of unimodally presented facial expressions. Furthermore, auditory (prosodic and/or lexical-semantic) information was presented together with the visual information to investigate the processing of bimodal (facial and prosodic cues) and multimodal (facial, lexic, and prosodic cues) human communication. Participants engaged in an identity identification task, while event-related potentials (ERPs) were being recorded to examine early processing mechanisms as reflected in the P200 and N300 component. While the former component has repeatedly been linked to physical property stimulus processing, the latter has been linked to more evaluative “meaning-related” processing. A direct relationship between P200 and N300 amplitude and the number of information channels present was found. The multimodal-channel condition elicited the smallest amplitude in the P200 and N300 components, followed by an increased amplitude in each component for the bimodal-channel condition. The largest amplitude was observed for the unimodal condition. These data suggest that multimodal information induces clear facilitation in comparison to unimodal or bimodal information. The advantage of multimodal perception as reflected in the P200 and N300 components may thus reflect one of the mechanisms allowing for fast and accurate information processing in human communication.


2017 ◽  
Vol 38 (11) ◽  
pp. 5691-5705 ◽  
Author(s):  
Luis Morís Fernández ◽  
Emiliano Macaluso ◽  
Salvador Soto-Faraco

2021 ◽  
Author(s):  
Talieh Seyed Tabtabae

Automatic Emotion Recognition (AER) is an emerging research area in the Human-Computer Interaction (HCI) field. As Computers are becoming more and more popular every day, the study of interaction between humans (users) and computers is catching more attention. In order to have a more natural and friendly interface between humans and computers, it would be beneficial to give computers the ability to recognize situations the same way a human does. Equipped with an emotion recognition system, computers will be able to recognize their users' emotional state and show the appropriate reaction to that. In today's HCI systems, machines can recognize the speaker and also content of the speech, using speech recognition and speaker identification techniques. If machines are equipped with emotion recognition techniques, they can also know "how it is said" to react more appropriately, and make the interaction more natural. One of the most important human communication channels is the auditory channel which carries speech and vocal intonation. In fact people can perceive each other's emotional state by the way they talk. Therefore in this work the speech signals are analyzed in order to set up an automatic system which recognizes the human emotional state. Six discrete emotional states have been considered and categorized in this research: anger, happiness, fear, surprise, sadness, and disgust. A set of novel spectral features are proposed in this contribution. Two approaches are applied and the results are compared. In the first approach, all the acoustic features are extracted from consequent frames along the speech signals. The statistical values of features are considered to constitute the features vectors. Suport Vector Machine (SVM), which is a relatively new approach in the field of machine learning is used to classify the emotional states. In the second approach, spectral features are extracted from non-overlapping logarithmically-spaced frequency sub-bands. In order to make use of all the extracted information, sequence discriminant SVMs are adopted. The empirical results show that the employed techniques are very promising.


Author(s):  
Panteleimon Chriskos ◽  
Orfeas Tsartsianidis

Human senses enable humans to perceive and interact with their environment, through a set of sensory systems or organs which are mainly dedicated to each sense. From the five main senses in humans hearing plays a critical role in many aspects of our lives. Hearing allows the perception not only of the immediate visible environment but also parts of the environment that are obstructed from view and/or that are a significant distance from the individual. One of the most important and sometimes overlooked aspects of hearing is communication, since most human communication is accomplished through speech and hearing. Hearing does not only convey speech but also conveys more complex messages in the form of music, singing and storytelling.


Sign in / Sign up

Export Citation Format

Share Document