Integration and Temporal Processing of Asynchronous Audiovisual Speech

2018 ◽  
Vol 30 (3) ◽  
pp. 319-337 ◽  
Author(s):  
David M. Simon ◽  
Mark T. Wallace

Multisensory integration of visual mouth movements with auditory speech is known to offer substantial perceptual benefits, particularly under challenging (i.e., noisy) acoustic conditions. Previous work characterizing this process has found that ERPs to auditory speech are of shorter latency and smaller magnitude in the presence of visual speech. We sought to determine the dependency of these effects on the temporal relationship between the auditory and visual speech streams using EEG. We found that reductions in ERP latency and suppression of ERP amplitude are maximal when the visual signal precedes the auditory signal by a small interval and that increasing amounts of asynchrony reduce these effects in a continuous manner. Time–frequency analysis revealed that these effects are found primarily in the theta (4–8 Hz) and alpha (8–12 Hz) bands, with a central topography consistent with auditory generators. Theta effects also persisted in the lower portion of the band (3.5–5 Hz), and this late activity was more frontally distributed. Importantly, the magnitude of these late theta oscillations not only differed with the temporal characteristics of the stimuli but also served to predict participants' task performance. Our analysis thus reveals that suppression of single-trial brain responses by visual speech depends strongly on the temporal concordance of the auditory and visual inputs. It further illustrates that processes in the lower theta band, which we suggest as an index of incongruity processing, might serve to reflect the neural correlates of individual differences in multisensory temporal perception.

2020 ◽  
Vol 117 (29) ◽  
pp. 16920-16927 ◽  
Author(s):  
John Plass ◽  
David Brang ◽  
Satoru Suzuki ◽  
Marcia Grabowecky

Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time–frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.


2019 ◽  
Author(s):  
John Plass ◽  
David Brang ◽  
Satoru Suzuki ◽  
Marcia Grabowecky

Visual speech facilitates auditory speech perception, but the visual cues responsible for these effects and the crossmodal information they provide remain unclear. Because visible articulators shape the spectral content of auditory speech, we hypothesized that listeners may be able to extract spectrotemporal information from visual speech to facilitate auditory speech perception. To uncover statistical regularities that could subserve such facilitations, we compared the resonant frequency of the oral cavity to the shape of the oral aperture during speech. We found that the time-frequency dynamics of oral resonances could be recovered with unexpectedly high precision from the shape of the mouth during speech. Because both auditory frequency modulations and visual shape properties are neurally encoded as mid-level perceptual features, we hypothesized that this feature-level correspondence would allow for spectrotemporal information to be recovered from visual speech without reference to higher order (e.g., phonemic) speech representations. Isolating these features from other speech cues, we found that speech-based shape deformations improved sensitivity for corresponding frequency modulations, suggesting that the perceptual system exploits crossmodal correlations in mid-level feature representations to enhance speech perception. To test whether this correspondence could be used to improve comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by crossmodal recovery of auditory speech spectra. Visual speech may therefore facilitate perception by crossmodally restoring degraded spectrotemporal signals in speech.


2019 ◽  
Vol 62 (10) ◽  
pp. 3860-3875 ◽  
Author(s):  
Kaylah Lalonde ◽  
Lynne A. Werner

Purpose This study assessed the extent to which 6- to 8.5-month-old infants and 18- to 30-year-old adults detect and discriminate auditory syllables in noise better in the presence of visual speech than in auditory-only conditions. In addition, we examined whether visual cues to the onset and offset of the auditory signal account for this benefit. Method Sixty infants and 24 adults were randomly assigned to speech detection or discrimination tasks and were tested using a modified observer-based psychoacoustic procedure. Each participant completed 1–3 conditions: auditory-only, with visual speech, and with a visual signal that only cued the onset and offset of the auditory syllable. Results Mixed linear modeling indicated that infants and adults benefited from visual speech on both tasks. Adults relied on the onset–offset cue for detection, but the same cue did not improve their discrimination. The onset–offset cue benefited infants for both detection and discrimination. Whereas the onset–offset cue improved detection similarly for infants and adults, the full visual speech signal benefited infants to a lesser extent than adults on the discrimination task. Conclusions These results suggest that infants' use of visual onset–offset cues is mature, but their ability to use more complex visual speech cues is still developing. Additional research is needed to explore differences in audiovisual enhancement (a) of speech discrimination across speech targets and (b) with increasingly complex tasks and stimuli.


Interpreting ◽  
2000 ◽  
Vol 5 (2) ◽  
pp. 95-115 ◽  
Author(s):  
Alexandra Jesse ◽  
Nick Vrignaud ◽  
Michael M. Cohen ◽  
Dominic W. Massaro

Language processing is influenced by multiple sources of information. We examined whether the performance in simultaneous interpreting would be improved when providing two sources of information, the auditory speech as well as corresponding lip-movements, in comparison to presenting the auditory speech alone. Although there was an improvement in sentence recognition when presented with visible speech, there was no difference in performance between these two presentation conditions when bilinguals simultaneously interpreted from English to German or from English to Spanish. The reason why visual speech did not contribute to performance could be the presentation of the auditory signal without noise (Massaro, 1998). This hypothesis should be tested in the future. Furthermore, it should be investigated if an effect of visible speech can be found for other contexts, when visual information could provide cues for emotions, prosody, or syntax.


2015 ◽  
Vol 27 (5) ◽  
pp. 1017-1028 ◽  
Author(s):  
Paul Metzner ◽  
Titus von der Malsburg ◽  
Shravan Vasishth ◽  
Frank Rösler

Recent research has shown that brain potentials time-locked to fixations in natural reading can be similar to brain potentials recorded during rapid serial visual presentation (RSVP). We attempted two replications of Hagoort, Hald, Bastiaansen, and Petersson [Hagoort, P., Hald, L., Bastiaansen, M., & Petersson, K. M. Integration of word meaning and world knowledge in language comprehension. Science, 304, 438–441, 2004] to determine whether this correspondence also holds for oscillatory brain responses. Hagoort et al. reported an N400 effect and synchronization in the theta and gamma range following world knowledge violations. Our first experiment (n = 32) used RSVP and replicated both the N400 effect in the ERPs and the power increase in the theta range in the time–frequency domain. In the second experiment (n = 49), participants read the same materials freely while their eye movements and their EEG were monitored. First fixation durations, gaze durations, and regression rates were increased, and the ERP showed an N400 effect. An analysis of time–frequency representations showed synchronization in the delta range (1–3 Hz) and desynchronization in the upper alpha range (11–13 Hz) but no theta or gamma effects. The results suggest that oscillatory EEG changes elicited by world knowledge violations are different in natural reading and RSVP. This may reflect differences in how representations are constructed and retrieved from memory in the two presentation modes.


2020 ◽  
Author(s):  
Jonathan E Peelle ◽  
Brent Spehar ◽  
Michael S Jones ◽  
Sarah McConkey ◽  
Joel Myerson ◽  
...  

In everyday conversation, we usually process the talker's face as well as the sound of their voice. Access to visual speech information is particularly useful when the auditory signal is degraded. Here we used fMRI to monitor brain activity while adults (n = 60) were presented with visual-only, auditory-only, and audiovisual words. As expected, audiovisual speech perception recruited both auditory and visual cortex, with a trend towards increased recruitment of premotor cortex in more difficult conditions (for example, in substantial background noise). We then investigated neural connectivity using psychophysiological interaction (PPI) analysis with seed regions in both primary auditory cortex and primary visual cortex. Connectivity between auditory and visual cortices was stronger in audiovisual conditions than in unimodal conditions, including a wide network of regions in posterior temporal cortex and prefrontal cortex. Taken together, our results suggest a prominent role for cross-region synchronization in understanding both visual-only and audiovisual speech.


NeuroImage ◽  
2015 ◽  
Vol 121 ◽  
pp. 39-50 ◽  
Author(s):  
Nicolas Lebar ◽  
Pierre-Michel Bernier ◽  
Alain Guillaume ◽  
Laurence Mouchnino ◽  
Jean Blouin

2016 ◽  
Vol 44 (1) ◽  
pp. 185-215 ◽  
Author(s):  
SUSAN JERGER ◽  
MARKUS F. DAMIAN ◽  
NANCY TYE-MURRAY ◽  
HERVÉ ABDI

AbstractAdults use vision to perceive low-fidelity speech; yet how children acquire this ability is not well understood. The literature indicates that children show reduced sensitivity to visual speech from kindergarten to adolescence. We hypothesized that this pattern reflects the effects of complex tasks and a growth period with harder-to-utilize cognitive resources, not lack of sensitivity. We investigated sensitivity to visual speech in children via the phonological priming produced by low-fidelity (non-intact onset) auditory speech presented audiovisually (see dynamic face articulate consonant/rhyme b/ag; hear non-intact onset/rhyme: –b/ag) vs. auditorily (see still face; hear exactly same auditory input). Audiovisual speech produced greater priming from four to fourteen years, indicating that visual speech filled in the non-intact auditory onsets. The influence of visual speech depended uniquely on phonology and speechreading. Children – like adults – perceive speech onsets multimodally. Findings are critical for incorporating visual speech into developmental theories of speech perception.


2002 ◽  
Vol 112 (5) ◽  
pp. 2245-2245
Author(s):  
Paul Bertelson ◽  
Jean Vroomen ◽  
Beatrice de Gelder

Author(s):  
Yoram Bonneh

Motion-induced blindness (MIB) is a phenomenon characterized by “visual disappearance” in which relatively small but salient visual objects may disappear from one’s awareness intermittently for several seconds when embedded within a moving pattern. It is a compelling example of multistable perception in which physically invariant stimulation leads to fluctuations in perception. The interest in MIB stems from its potential use in studying visual processing outside the locus of awareness and the neural correlates of consciousness. Current studies of MIB provide evidence against low-level suppression of the visual signal and demonstrate residual processing of the invisible. This chapter explores these and related concepts.


Sign in / Sign up

Export Citation Format

Share Document