Automatic audiovisual integration in speech perception

If a place-of-articulation contrast is created between the auditory and the visual component syllables of videotaped speech, frequently the syllable that listeners report they have heard differs phonetically from the auditory component. These “McGurk effects”, as they have come to be called, show that speech perception may involve some kind of intermodal process. There are two classes of these phenomena: fusions and combinations. Perception of the syllable /da/ when auditory /ba/ and visual /ga/ are presented provides a clear example of the former, and perception of the string /bga/ after presentation of auditory /ga/ and visual /ba/ an unambiguous instance of the latter. Besides perceptual fusions and combinations, hearing visually presented component syllables also shows an influence of vision on audition. It is argued that these “visual” responses arise from basically the same underlying processes that yield fusions and combinations, respectively. In the present study, the visual component of audiovisually incongruous CV-syllables was presented in the left and the right visual hemifield, respectively. Audiovisual fusion responses showed a left hemifield advantage, and audiovisual combination responses a right hemifield advantage. This finding suggests that the process of audiovisual integration differs between audiovisual fusions and combinations and, furthermore, that the two cerebral hemispheres contribute differentially to the two classes of response.

Download Full-text

Audiovisual integration in speech perception

Frontiers in Systems Neuroscience ◽

10.3389/conf.neuro.01.2009.04.090 ◽

2009 ◽

Vol 3 ◽

Author(s):

Csepe Valeria

Keyword(s):

Speech Perception ◽

Audiovisual Integration

Download Full-text

The early maximum likelihood estimation model of audiovisual integration in speech perception

The Journal of the Acoustical Society of America ◽

10.1121/1.4916691 ◽

2015 ◽

Vol 137 (5) ◽

pp. 2884-2891 ◽

Cited By ~ 6

Author(s):

Tobias S. Andersen

Keyword(s):

Speech Perception ◽

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Likelihood Estimation ◽

Audiovisual Integration ◽

Estimation Model

Download Full-text

Importance of temporal cues in audiovisual integration in speech perception in noise

The Journal of the Acoustical Society of America ◽

10.1121/1.5146815 ◽

2020 ◽

Vol 148 (4) ◽

pp. 2465-2466

Author(s):

Yi Yuan ◽

Yonghee Oh

Keyword(s):

Speech Perception ◽

Audiovisual Integration ◽

Speech Perception In Noise ◽

Temporal Cues

Download Full-text

Enhanced audiovisual integration with aging in speech perception: a heightened McGurk effect in older adults

Frontiers in Psychology ◽

10.3389/fpsyg.2014.00323 ◽

2014 ◽

Vol 5 ◽

Cited By ~ 29

Author(s):

Kaoru Sekiyama ◽

Takahiro Soshi ◽

Shinichi Sakamoto

Keyword(s):

Older Adults ◽

Speech Perception ◽

Audiovisual Integration ◽

Mcgurk Effect

Download Full-text

Predictive power in models of audiovisual integration of speech

Seeing and Perceiving ◽

10.1163/187847612x647379 ◽

2012 ◽

Vol 25 (0) ◽

pp. 105 ◽

Cited By ~ 1

Author(s):

Tobias Søren Andersen

Keyword(s):

Speech Perception ◽

Predictive Power ◽

Internal Representation ◽

Audiovisual Integration ◽

Natural Case ◽

The Face ◽

Fuzzy Logical Model ◽

Mcgurk Illusion ◽

The Voice ◽

Talking Face

Seeing the talking face can influence the phoneme perceived from the voice. This facilitates speech perception in the natural case where the face and voice are congruent and can cause the McGurk illusion when they are not. The classical example of the McGurk illusion is when acoustic /aba/ is perceived as /ada/ when dubbed onto a face articulating /aga/. In order to fully understand the underlying process of integrating information across the senses we need a computational account with predictive power. The Fuzzy Logical Model of Perception is one computational account of audiovisual integration in speech perception. Here we describe alternative accounts in which integration is based on an early continuous internal representation on which the phonetic classes fall. We show that these alternative accounts can provide just as good a fit when corrected for the number of free parameters. We also show, using cross-validation, that they have greater, but not great, predictive power. Finally, we show that introducing a regularization term can amend the lack of predictive power. With regularization, models based on continuous representations have the highest predictive power.

Download Full-text

Audiovisual integration in speech perception is independent from perceptual synchrony between audiovisual signals

The Proceedings of the Annual Convention of the Japanese Psychological Association ◽

10.4992/pacjpa.79.0_2pm-064 ◽

2015 ◽

Vol 79 (0) ◽

pp. 2PM-064-2PM-064

Author(s):

Norimichi Kitagawa ◽

Takemi Mochida ◽

Miho Kitamura

Keyword(s):

Speech Perception ◽

Audiovisual Integration

Download Full-text

Regularization improves models of audiovisual integration in speech perception

Multisensory Research ◽

10.1163/22134808-000s0039 ◽

2013 ◽

Vol 26 (1-2) ◽

pp. 61

Author(s):

Tobias S. Andersen

Keyword(s):

Speech Perception ◽

Audiovisual Integration

Download Full-text

Hearing Facial Identities

Quarterly Journal of Experimental Psychology ◽

10.1080/17470210601063589 ◽

2007 ◽

Vol 60 (10) ◽

pp. 1446-1456 ◽

Cited By ~ 61

Author(s):

Stefan R. Schweinberger ◽

David Robertson ◽

Jürgen M. Kaufmann

Keyword(s):

Speech Perception ◽

Speaker Recognition ◽

Direct Evidence ◽

Voice Recognition ◽

Audiovisual Integration ◽

Facial Identity ◽

Person Recognition ◽

Static Pictures ◽

Multimodal Representation ◽

Benefits And Costs

While audiovisual integration is well known in speech perception, faces and speech are also informative with respect to speaker recognition. To date, audiovisual integration in the recognition of familiar people has never been demonstrated. Here we show systematic benefits and costs for the recognition of familiar voices when these are combined with time-synchronized articulating faces, of corresponding or noncorresponding speaker identity, respectively. While these effects were strong for familiar voices, they were smaller or nonsignificant for unfamiliar voices, suggesting that the effects depend on the previous creation of a multimodal representation of a person's identity. Moreover, the effects were reduced or eliminated when voices were combined with the same faces presented as static pictures, demonstrating that the effects do not simply reflect the use of facial identity as a “cue” for voice recognition. This is the first direct evidence for audiovisual integration in person recognition.

Download Full-text