Audio Visual Integration with Competing Sources in the Framework of Audio Visual Speech Scene Analysis

[abstFig src='/00290001/10.jpg' width='300' text='System architecture of AVSR based on missing feature theory and P-V grouping' ] Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.**This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp.1751-1756, 2007.”

Download Full-text

Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers

Trends in Hearing ◽

10.1177/2331216519837866 ◽

2019 ◽

Vol 23 ◽

pp. 233121651983786 ◽

Cited By ~ 2

Author(s):

Catherine L. Blackburn ◽

Pádraig T. Kitterick ◽

Gary Jones ◽

Christian J. Sumner ◽

Paula C. Stacey

Keyword(s):

Speech Intelligibility ◽

Noise Signal ◽

Sine Wave ◽

Theory Model ◽

Visual Speech ◽

Clear Speech ◽

Visual Integration ◽

Degraded Speech ◽

The Face ◽

Vocoded Speech

Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

Download Full-text

Audio-visual speech scene analysis: Characterization of the dynamics of unbinding and rebinding the McGurk effect

The Journal of the Acoustical Society of America ◽

10.1121/1.4904536 ◽

2015 ◽

Vol 137 (1) ◽

pp. 362-377 ◽

Cited By ~ 22

Author(s):

Olha Nahorna ◽

Frédéric Berthommier ◽

Jean-Luc Schwartz

Keyword(s):

Mcgurk Effect ◽

Visual Speech ◽

Scene Analysis

Download Full-text

Lipreading and audio-visual speech perception

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.1992.0009 ◽

1992 ◽

Vol 335 (1273) ◽

pp. 71-78 ◽

Cited By ~ 201

Keyword(s):

Speech Perception ◽

Visual Cues ◽

Visual Speech ◽

Speech Signals ◽

Visual Integration ◽

Computer Animations ◽

Visual Speech Perception

This paper reviews progress in understanding the psychology of lipreading and audio-visual speech perception. It considers four questions. What distinguishes better from poorer lipreaders? What are the effects of introducing a delay between the acoustical and optical speech signals? What have attempts to produce computer animations of talking faces contributed to our understanding of the visual cues that distinguish consonants and vowels? Finally, how should the process of audio-visual integration in speech perception be described; that is, how are the sights and sounds of talking faces represented at their conflux?

Download Full-text