Audiovisual speech perception in infancy: The influence of vowel identity and infants’ productive abilities on sensitivity to (mis)matches between auditory and visual speech cues.

2016 ◽  
Vol 52 (2) ◽  
pp. 191-204 ◽  
Author(s):  
Nicole Altvater-Mackensen ◽  
Nivedita Mani ◽  
Tobias Grossmann
2020 ◽  
Author(s):  
Jonathan E Peelle ◽  
Brent Spehar ◽  
Michael S Jones ◽  
Sarah McConkey ◽  
Joel Myerson ◽  
...  

In everyday conversation, we usually process the talker's face as well as the sound of their voice. Access to visual speech information is particularly useful when the auditory signal is degraded. Here we used fMRI to monitor brain activity while adults (n = 60) were presented with visual-only, auditory-only, and audiovisual words. As expected, audiovisual speech perception recruited both auditory and visual cortex, with a trend towards increased recruitment of premotor cortex in more difficult conditions (for example, in substantial background noise). We then investigated neural connectivity using psychophysiological interaction (PPI) analysis with seed regions in both primary auditory cortex and primary visual cortex. Connectivity between auditory and visual cortices was stronger in audiovisual conditions than in unimodal conditions, including a wide network of regions in posterior temporal cortex and prefrontal cortex. Taken together, our results suggest a prominent role for cross-region synchronization in understanding both visual-only and audiovisual speech.


2011 ◽  
Vol 24 (1) ◽  
pp. 67-90 ◽  
Author(s):  
Riikka Möttönen ◽  
Kaisa Tiippana ◽  
Mikko Sams ◽  
Hanna Puharinen

AbstractAudiovisual speech perception has been considered to operate independent of sound location, since the McGurk effect (altered auditory speech perception caused by conflicting visual speech) has been shown to be unaffected by whether speech sounds are presented in the same or different location as a talking face. Here we show that sound location effects arise with manipulation of spatial attention. Sounds were presented from loudspeakers in five locations: the centre (location of the talking face) and 45°/90° to the left/right. Auditory spatial attention was focused on a location by presenting the majority (90%) of sounds from this location. In Experiment 1, the majority of sounds emanated from the centre, and the McGurk effect was enhanced there. In Experiment 2, the major location was 90° to the left, causing the McGurk effect to be stronger on the left and centre than on the right. Under control conditions, when sounds were presented with equal probability from all locations, the McGurk effect tended to be stronger for sounds emanating from the centre, but this tendency was not reliable. Additionally, reaction times were the shortest for a congruent audiovisual stimulus, and this was the case independent of location. Our main finding is that sound location can modulate audiovisual speech perception, and that spatial attention plays a role in this modulation.


Author(s):  
Dominic W. Massaro ◽  
Alexandra Jesse

This article gives an overview of the main research questions and findings unique to audiovisual speech perception research, and discusses what general questions about speech perception and cognition the research in this field can answer. The influence of a second perceptual source in audiovisual speech perception compared to auditory speech perception immediately necessitates the question of how the information from the different perceptual sources is used to reach the best overall decision. The article explores how our understanding of speech benefits from having the speaker's face present, and how this benefit makes transparent the nature of speech perception and word recognition. Modern communication methods such as Voice over Internet Protocol find a wide acceptance, but people are reluctant to forfeit face-to-face communication. The article also considers the role of visual speech as a language-learning tool in multimodal training, information and information processing in audiovisual speech perception, lexicon and word recognition, facial information for speech perception, and theories of audiovisual speech perception.


2018 ◽  
Vol 31 (1-2) ◽  
pp. 7-18 ◽  
Author(s):  
John MacDonald

In 1976 Harry McGurk and I published a paper in Nature, entitled ‘Hearing Lips and Seeing Voices’. The paper described a new audio–visual illusion we had discovered that showed the perception of auditorily presented speech could be influenced by the simultaneous presentation of incongruent visual speech. This hitherto unknown effect has since had a profound impact on audiovisual speech perception research. The phenomenon has come to be known as the ‘McGurk effect’, and the original paper has been cited in excess of 4800 times. In this paper I describe the background to the discovery of the effect, the rationale for the generation of the initial stimuli, the construction of the exemplars used and the serendipitous nature of the finding. The paper will also cover the reaction (and non-reaction) to the Nature publication, the growth of research on, and utilizing the ‘McGurk effect’ and end with some reflections on the significance of the finding.


Author(s):  
Lawrence D. Rosenblum

Research on visual and audiovisual speech information has profoundly influenced the fields of psycholinguistics, perception psychology, and cognitive neuroscience. Visual speech findings have provided some of most the important human demonstrations of our new conception of the perceptual brain as being supremely multimodal. This “multisensory revolution” has seen a tremendous growth in research on how the senses integrate, cross-facilitate, and share their experience with one another. The ubiquity and apparent automaticity of multisensory speech has led many theorists to propose that the speech brain is agnostic with regard to sense modality: it might not know or care from which modality speech information comes. Instead, the speech function may act to extract supramodal informational patterns that are common in form across energy streams. Alternatively, other theorists have argued that any common information existent across the modalities is minimal and rudimentary, so that multisensory perception largely depends on the observer’s associative experience between the streams. From this perspective, the auditory stream is typically considered primary for the speech brain, with visual speech simply appended to its processing. If the utility of multisensory speech is a consequence of a supramodal informational coherence, then cross-sensory “integration” may be primarily a consequence of the informational input itself. If true, then one would expect to see evidence for integration occurring early in the perceptual process, as well in a largely complete and automatic/impenetrable manner. Alternatively, if multisensory speech perception is based on associative experience between the modal streams, then no constraints on how completely or automatically the senses integrate are dictated. There is behavioral and neurophysiological research supporting both perspectives. Much of this research is based on testing the well-known McGurk effect, in which audiovisual speech information is thought to integrate to the extent that visual information can affect what listeners report hearing. However, there is now good reason to believe that the McGurk effect is not a valid test of multisensory integration. For example, there are clear cases in which responses indicate that the effect fails, while other measures suggest that integration is actually occurring. By mistakenly conflating the McGurk effect with speech integration itself, interpretations of the completeness and automaticity of multisensory may be incorrect. Future research should use more sensitive behavioral and neurophysiological measures of cross-modal influence to examine these issues.


Perception ◽  
10.1068/p3316 ◽  
2003 ◽  
Vol 32 (8) ◽  
pp. 921-936 ◽  
Author(s):  
Maxine V McCotter ◽  
Timothy R Jordan

We conducted four experiments to investigate the role of colour and luminance information in visual and audiovisual speech perception. In experiments la (stimuli presented in quiet conditions) and 1b (stimuli presented in auditory noise), face display types comprised naturalistic colour (NC), grey-scale (GS), and luminance inverted (LI) faces. In experiments 2a (quiet) and 2b (noise), face display types comprised NC, colour inverted (CI), LI, and colour and luminance inverted (CLI) faces. Six syllables and twenty-two words were used to produce auditory and visual speech stimuli. Auditory and visual signals were combined to produce congruent and incongruent audiovisual speech stimuli. Experiments 1a and 1b showed that perception of visual speech, and its influence on identifying the auditory components of congruent and incongruent audiovisual speech, was less for LI than for either NC or GS faces, which produced identical results. Experiments 2a and 2b showed that perception of visual speech, and influences on perception of incongruent auditory speech, was less for LI and CLI faces than for NC and CI faces (which produced identical patterns of performance). Our findings for NC and CI faces suggest that colour is not critical for perception of visual and audiovisual speech. The effect of luminance inversion on performance accuracy was relatively small (5%), which suggests that the luminance information preserved in LI faces is important for the processing of visual and audiovisual speech.


Author(s):  
Yi Yuan ◽  
Kelli Meyers ◽  
Kayla Borges ◽  
Yasneli Lleo ◽  
Katarina A. Fiorentino ◽  
...  

Purpose This study investigated the effects of visually presented speech envelope information with various modulation rates and depths on audiovisual speech perception in noise. Method Forty adults (21.25 ± 1.45 years) participated in audiovisual sentence recognition measurements in noise. Target speech sentences were auditorily presented in multitalker babble noises at a −3 dB SNR. Acoustic amplitude envelopes of target signals were extracted through low-pass filters with different cutoff frequencies (4, 10, and 30 Hz) and a fixed modulation depth at 100% (Experiment 1) or extracted with various modulation depths (0%, 25%, 50%, 75%, and 100%) and a fixed 10-Hz modulation rate (Experiment 2). The extracted target envelopes were synchronized with the amplitude of a spherical-shaped ball and presented as visual stimuli. Subjects were instructed to attend to both auditory and visual stimuli of the target sentences and type down their answers. The sentence recognition accuracy was compared between audio-only and audiovisual conditions. Results In Experiment 1, a significant improvement in speech intelligibility was observed when the visual analog (a sphere) synced with the acoustic amplitude envelope modulated at a 10-Hz modulation rate compared to the audio-only condition. In Experiment 2, the visual analog with 75% modulation depth resulted in better audiovisual speech perception in noise compared to the other modulation depth conditions. Conclusion An abstract visual analog of acoustic amplitude envelopes can be efficiently delivered by the visual system and integrated online with auditory signals to enhance speech perception in noise, independent of particular articulation movements.


Perception ◽  
10.1068/p5852 ◽  
2007 ◽  
Vol 36 (10) ◽  
pp. 1535-1545 ◽  
Author(s):  
Ian T Everdell ◽  
Heidi Marsh ◽  
Micheal D Yurick ◽  
Kevin G Munhall ◽  
Martin Paré

Speech perception under natural conditions entails integration of auditory and visual information. Understanding how visual and auditory speech information are integrated requires detailed descriptions of the nature and processing of visual speech information. To understand better the process of gathering visual information, we studied the distribution of face-directed fixations of humans performing an audiovisual speech perception task to characterise the degree of asymmetrical viewing and its relationship to speech intelligibility. Participants showed stronger gaze fixation asymmetries while viewing dynamic faces, compared to static faces or face-like objects, especially when gaze was directed to the talkers' eyes. Although speech perception accuracy was significantly enhanced by the viewing of congruent, dynamic faces, we found no correlation between task performance and gaze fixation asymmetry. Most participants preferentially fixated the right side of the faces and their preferences persisted while viewing horizontally mirrored stimuli, different talkers, or static faces. These results suggest that the asymmetrical distributions of gaze fixations reflect the participants' viewing preferences, rather than being a product of asymmetrical faces, but that this behavioural bias does not predict correct audiovisual speech perception.


2021 ◽  
pp. 1-17
Author(s):  
Yuta Ujiie ◽  
Kohske Takahashi

Abstract While visual information from facial speech modulates auditory speech perception, it is less influential on audiovisual speech perception among autistic individuals than among typically developed individuals. In this study, we investigated the relationship between autistic traits (Autism-Spectrum Quotient; AQ) and the influence of visual speech on the recognition of Rubin’s vase-type speech stimuli with degraded facial speech information. Participants were 31 university students (13 males and 18 females; mean age: 19.2, SD: 1.13 years) who reported normal (or corrected-to-normal) hearing and vision. All participants completed three speech recognition tasks (visual, auditory, and audiovisual stimuli) and the AQ–Japanese version. The results showed that accuracies of speech recognition for visual (i.e., lip-reading) and auditory stimuli were not significantly related to participants’ AQ. In contrast, audiovisual speech perception was less susceptible to facial speech perception among individuals with high rather than low autistic traits. The weaker influence of visual information on audiovisual speech perception in autism spectrum disorder (ASD) was robust regardless of the clarity of the visual information, suggesting a difficulty in the process of audiovisual integration rather than in the visual processing of facial speech.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246986
Author(s):  
Alma Lindborg ◽  
Tobias S. Andersen

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.


Sign in / Sign up

Export Citation Format

Share Document