scholarly journals Vision perceptually restores auditory spectral dynamics in speech

2020 ◽  
Vol 117 (29) ◽  
pp. 16920-16927 ◽  
Author(s):  
John Plass ◽  
David Brang ◽  
Satoru Suzuki ◽  
Marcia Grabowecky

Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time–frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.

2019 ◽  
Author(s):  
John Plass ◽  
David Brang ◽  
Satoru Suzuki ◽  
Marcia Grabowecky

Visual speech facilitates auditory speech perception, but the visual cues responsible for these effects and the crossmodal information they provide remain unclear. Because visible articulators shape the spectral content of auditory speech, we hypothesized that listeners may be able to extract spectrotemporal information from visual speech to facilitate auditory speech perception. To uncover statistical regularities that could subserve such facilitations, we compared the resonant frequency of the oral cavity to the shape of the oral aperture during speech. We found that the time-frequency dynamics of oral resonances could be recovered with unexpectedly high precision from the shape of the mouth during speech. Because both auditory frequency modulations and visual shape properties are neurally encoded as mid-level perceptual features, we hypothesized that this feature-level correspondence would allow for spectrotemporal information to be recovered from visual speech without reference to higher order (e.g., phonemic) speech representations. Isolating these features from other speech cues, we found that speech-based shape deformations improved sensitivity for corresponding frequency modulations, suggesting that the perceptual system exploits crossmodal correlations in mid-level feature representations to enhance speech perception. To test whether this correspondence could be used to improve comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by crossmodal recovery of auditory speech spectra. Visual speech may therefore facilitate perception by crossmodally restoring degraded spectrotemporal signals in speech.


2020 ◽  
Author(s):  
Johannes Rennig ◽  
Michael S Beauchamp

AbstractRegions of the human posterior superior temporal gyrus and sulcus (pSTG/S) respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We hypothesized that these multisensory responses in pSTG/S underlie the observation that comprehension of noisy auditory speech is improved when it is accompanied by visual speech. To test this idea, we presented audiovisual sentences that contained either a clear auditory component or a noisy auditory component while measuring brain activity using BOLD fMRI. Participants reported the intelligibility of the speech on each trial with a button press. Perceptually, adding visual speech to noisy auditory sentences rendered them much more intelligible. Post-hoc trial sorting was used to examine brain activations during noisy sentences that were more or less intelligible, focusing on multisensory speech regions in the pSTG/S identified with an independent visual speech localizer. Univariate analysis showed that less intelligible noisy audiovisual sentences evoked a weaker BOLD response, while more intelligible sentences evoked a stronger BOLD response that was indistinguishable from clear sentences. To better understand these differences, we conducted a multivariate representational similarity analysis. The pattern of response for intelligible noisy audiovisual sentences was more similar to the pattern for clear sentences, while the response pattern for unintelligible noisy sentences was less similar. These results show that for both univariate and multivariate analyses, successful integration of visual and noisy auditory speech normalizes responses in pSTG/S, providing evidence that multisensory subregions of pSTG/S are responsible for the perceptual benefit of visual speech.Significance StatementEnabling social interactions, including the production and perception of speech, is a key function of the human brain. Speech perception is a complex computational problem that the brain solves using both visual information from the talker’s facial movements and auditory information from the talker’s voice. Visual speech information is particularly important under noisy listening conditions when auditory speech is difficult or impossible to understand alone Regions of the human cortex in posterior superior temporal lobe respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We show that the pattern of activity in cortex reflects the successful multisensory integration of auditory and visual speech information in the service of perception.


2020 ◽  
Author(s):  
Aisling E. O’Sullivan ◽  
Michael J. Crosse ◽  
Giovanni M. Di Liberto ◽  
Alain de Cheveigné ◽  
Edmund C. Lalor

AbstractSeeing a speaker’s face benefits speech comprehension, especially in challenging listening conditions. This perceptual benefit is thought to stem from the neural integration of visual and auditory speech at multiple stages of processing, whereby movement of a speaker’s face provides temporal cues to auditory cortex, and articulatory information from the speaker’s mouth can aid recognizing specific linguistic units (e.g., phonemes, syllables). However it remains unclear how the integration of these cues varies as a function of listening conditions. Here we sought to provide insight on these questions by examining EEG responses to natural audiovisual, audio, and visual speech in quiet and in noise. Specifically, we represented our speech stimuli in terms of their spectrograms and their phonetic features, and then quantified the strength of the encoding of those features in the EEG using canonical correlation analysis. The encoding of both spectrotemporal and phonetic features was shown to be more robust in audiovisual speech responses then what would have been expected from the summation of the audio and visual speech responses, consistent with the literature on multisensory integration. Furthermore, the strength of this multisensory enhancement was more pronounced at the level of phonetic processing for speech in noise relative to speech in quiet, indicating that listeners rely more on articulatory details from visual speech in challenging listening conditions. These findings support the notion that the integration of audio and visual speech is a flexible, multistage process that adapts to optimize comprehension based on the current listening conditions.Significance StatementDuring conversation, visual cues impact our perception of speech. Integration of auditory and visual speech is thought to occur at multiple stages of speech processing and vary flexibly depending on the listening conditions. Here we examine audiovisual integration at two stages of speech processing using the speech spectrogram and a phonetic representation, and test how audiovisual integration adapts to degraded listening conditions. We find significant integration at both of these stages regardless of listening conditions, and when the speech is noisy, we find enhanced integration at the phonetic stage of processing. These findings provide support for the multistage integration framework and demonstrate its flexibility in terms of a greater reliance on visual articulatory information in challenging listening conditions.


2018 ◽  
Vol 30 (3) ◽  
pp. 319-337 ◽  
Author(s):  
David M. Simon ◽  
Mark T. Wallace

Multisensory integration of visual mouth movements with auditory speech is known to offer substantial perceptual benefits, particularly under challenging (i.e., noisy) acoustic conditions. Previous work characterizing this process has found that ERPs to auditory speech are of shorter latency and smaller magnitude in the presence of visual speech. We sought to determine the dependency of these effects on the temporal relationship between the auditory and visual speech streams using EEG. We found that reductions in ERP latency and suppression of ERP amplitude are maximal when the visual signal precedes the auditory signal by a small interval and that increasing amounts of asynchrony reduce these effects in a continuous manner. Time–frequency analysis revealed that these effects are found primarily in the theta (4–8 Hz) and alpha (8–12 Hz) bands, with a central topography consistent with auditory generators. Theta effects also persisted in the lower portion of the band (3.5–5 Hz), and this late activity was more frontally distributed. Importantly, the magnitude of these late theta oscillations not only differed with the temporal characteristics of the stimuli but also served to predict participants' task performance. Our analysis thus reveals that suppression of single-trial brain responses by visual speech depends strongly on the temporal concordance of the auditory and visual inputs. It further illustrates that processes in the lower theta band, which we suggest as an index of incongruity processing, might serve to reflect the neural correlates of individual differences in multisensory temporal perception.


Neurosurgery ◽  
2019 ◽  
Vol 66 (Supplement_1) ◽  
Author(s):  
Patrick J Karas ◽  
John F Magnotti ◽  
Zhengjia Wang ◽  
Brian A Metzger ◽  
Daniel Yoshor ◽  
...  

Abstract INTRODUCTION Speech is multisensory. The addition of visual speech to auditory speech greatly improves comprehension, especially under noisy auditory conditions. However, the neural mechanism for this visual enhancement of auditory speech is poorly understood. We used electrocorticography (ECoG) to study how auditory, visual, and audiovisual speech is processed in the posterior superior temporal gyrus (pSTG), an area of auditory association cortex involved in audiovisual speech integration. We hypothesized that early visual mouth movements modulate audiovisual speech integration through a mechanism of cross-modal suppression, suggesting that the pSTG response to early mouth movements should correlate with comprehension benefits gained by the addition of visual speech to auditory speech. METHODS Words were presented under auditory-only (AUD), visual-only (VIS), and audiovisual (AV) conditions to epilepsy patients (n = 8) implanted with intracranial electrodes for phase-2 monitoring. We measured high-frequency broadband activity (75-150 Hz), a marker for local neuronal firing, in 28 electrodes over the pSTG. RESULTS The early neural response to visual-only words was compared to the reduction in neural response seen from AUD to AV words, a reduction correlated with an improvement in speech comprehension that occurs with the addition of visual to auditory speech. In words that showed a comprehension benefit with the addition of visual speech, there was a strong early response to visual speech and a correlation between early visual response and the AUD-AV difference (r = 0.64, P = 104). In words where visual speech did not provide any comprehension benefit, there was a weak early visual response and no correlation (r = 0.18, P = .35). CONCLUSION Words with a visual speech comprehension benefit also elicit a strong neural response to early visual speech in pSTG, while words with no comprehension benefit do not cause a strong early response. This suggests that cross-modal suppression of auditory association cortex (pSTG) by early visual plays an important role in audiovisual speech perception.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Patrick J Karas ◽  
John F Magnotti ◽  
Brian A Metzger ◽  
Lin L Zhu ◽  
Kristen B Smith ◽  
...  

Visual information about speech content from the talker’s mouth is often available before auditory information from the talker's voice. Here we examined perceptual and neural responses to words with and without this visual head start. For both types of words, perception was enhanced by viewing the talker's face, but the enhancement was significantly greater for words with a head start. Neural responses were measured from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. The presence of visual speech suppressed responses to auditory speech, more so for words with a visual head start. We suggest that the head start inhibits representations of incompatible auditory phonemes, increasing perceptual accuracy and decreasing total neural responses. Together with previous work showing visual cortex modulation (Ozker et al., 2018b) these results from pSTG demonstrate that multisensory interactions are a powerful modulator of activity throughout the speech perception network.


2016 ◽  
Vol 116 (3) ◽  
pp. 1387-1395 ◽  
Author(s):  
Raghavan Gopalakrishnan ◽  
Richard C. Burgess ◽  
Scott F. Lempka ◽  
John T. Gale ◽  
Darlene P. Floden ◽  
...  

Central poststroke pain (CPSP) is characterized by hemianesthesia associated with unrelenting chronic pain. The final pain experience stems from interactions between sensory, affective, and cognitive components of chronic pain. Hence, managing CPSP will require integrated approaches aimed not only at the sensory but also the affective-cognitive spheres. A better understanding of the brain's processing of pain anticipation is critical for the development of novel therapeutic approaches that target affective-cognitive networks and alleviate pain-related disability. We used magnetoencephalography (MEG) to characterize the neural substrates of pain anticipation in patients suffering from intractable CPSP. Simple visual cues evoked anticipation while patients awaited impending painful (PS), nonpainful (NPS), or no stimulus (NOS) to their nonaffected and affected extremities. MEG responses were studied at gradiometer level using event-related fields analysis and time-frequency oscillatory analysis upon source localization. On the nonaffected side, significantly greater responses were recorded during PS. PS (vs. NPS and NOS) exhibited significant parietal and frontal cortical activations in the beta and gamma bands, respectively, whereas NPS (vs. NOS) displayed greater activation in the orbitofrontal cortex. On the affected extremity, PS (vs. NPS) did not show significantly greater responses. These data suggest that anticipatory phenomena can modulate neural activity when painful stimuli are applied to the nonaffected extremity but not the affected extremity in CPSP patients. This dichotomy may stem from the chronic effects of pain on neural networks leading to habituation or saturation. Future clinically effective therapies will likely be associated with partial normalization of the neurophysiological correlates of pain anticipation.


1997 ◽  
Vol 40 (2) ◽  
pp. 432-443 ◽  
Author(s):  
Karen S. Helfer

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.


2021 ◽  
Vol 2 ◽  
Author(s):  
A. Maneuvrier ◽  
L. M. Decker ◽  
P. Renaud ◽  
G. Ceyte ◽  
H. Ceyte

Field dependence–independence (FDI) is a psychological construct determining an individual’s approach of the perception–cognition coupling. In virtual reality (VR) context, several studies suggest that an individual’s perceptive style is susceptible to shift toward a more FI mode through down-weighting of conflicting visual cues. The present study proposes to investigate the potential flexible nature of FDI following a virtual immersion and to assess if this flexibility might be associated with the subjective experience of VR. 86 participants explored a real-world–like virtual environment for approximately 10 min. FDI levels were measured before and after the VR exposure using the rod-and-frame test. Their subjective experience of VR was measured a posteriori (cybersickness and sense of presence) and used in order to build two experimental groups via a cluster analysis. The results showed that only participants with a poor subjective experience of VR (i.e., a low level of sense of presence associated with a high level of cybersickness) significantly shifted to a more FI mode, which is discussed as a sensory re-weighting mechanism. Pragmatical applications are discussed, and future studies are outlined, based on the conclusion that FDI might be more flexible than we thought, which could shed light on the psychophysiology of VR.


1970 ◽  
Vol 13 (4) ◽  
pp. 856-860 ◽  
Author(s):  
Henning Birk Nielsen

This study was designed to standardize the measurement of visual speech comprehension in order to facilitate the evaluation of a patient’s communication handicap, his basic capacity for lipreading and the benefit received through lipreading training as a clinical tool. To this end, a silent color film of four minutes' duration was used to obtain a standardized measurement of the lipreading ability of 1108 hearing-impaired subjects. The film depicts an everyday situation in which two persons, while having coffee together, speak nine sentences of varying difficulty. Item analysis reveals good consistency, and the score obtained by the individual patient corresponds well with the clinical evaluation of his lipreading ability.


Sign in / Sign up

Export Citation Format

Share Document