The Role of Multisensory Temporal Covariation in Audiovisual Speech Recognition in Noise
Static and dynamic visual speech cues contribute to audiovisual (AV) speech recognition in noise. Static cues (e.g., “lipreading”) provide complementary information that enables perceivers to ascertain ambiguous acoustic-phonetic content. The role of dynamic cues is less clear, but one suggestion is that temporal covariation between facial motion trajectories and the speech envelope enables perceivers to recover a more robust representation of the time-varying acoustic signal. Modeling studies show this is computationally feasible, though it has not been confirmed experimentally. We conducted two experiments to determine whether AV speech recognition depends on the magnitude of cross-sensory temporal coherence (AVC). In Experiment 1, sentence-keyword recognition in steady-state noise (SSN) was assessed across a range of signal-to-noise ratios (SNRs) for auditory and AV speech. The auditory signal was unprocessed or filtered to remove 3-7 Hz temporal modulations. Filtering severely reduced AVC (magnitude-squared coherence of lip trajectories with cochlear-narrowband speech envelopes), but did not reduce the magnitude of the AV advantage (AV > A; ~ 4 dB). This did not depend on the presence of static cues, manipulated via facial blurring. Experiment 2 assessed AV speech recognition in SSN at a fixed SNR (-10.5 dB) for subsets of Exp. 1 stimuli with naturally high or low AVC. A small effect (~ 5% correct; high-AVC > low-AVC) was observed. A computational model of AV speech intelligibility based on AVC yielded good overall predictions of performance, but over-predicted the differential effects of AVC. These results suggest the role and/or computational characterization of AVC must be re-conceptualized.