scholarly journals The visual speech head start improves perception and reduces superior temporal cortex responses to auditory speech

eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Patrick J Karas ◽  
John F Magnotti ◽  
Brian A Metzger ◽  
Lin L Zhu ◽  
Kristen B Smith ◽  
...  

Visual information about speech content from the talker’s mouth is often available before auditory information from the talker's voice. Here we examined perceptual and neural responses to words with and without this visual head start. For both types of words, perception was enhanced by viewing the talker's face, but the enhancement was significantly greater for words with a head start. Neural responses were measured from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. The presence of visual speech suppressed responses to auditory speech, more so for words with a visual head start. We suggest that the head start inhibits representations of incompatible auditory phonemes, increasing perceptual accuracy and decreasing total neural responses. Together with previous work showing visual cortex modulation (Ozker et al., 2018b) these results from pSTG demonstrate that multisensory interactions are a powerful modulator of activity throughout the speech perception network.

2019 ◽  
Author(s):  
Patrick J. Karas ◽  
John F. Magnotti ◽  
Brian A. Metzger ◽  
Lin L. Zhu ◽  
Kristen B. Smith ◽  
...  

AbstractVision provides a perceptual head start for speech perception because most speech is “mouth-leading”: visual information from the talker’s mouth is available before auditory information from the voice. However, some speech is “voice-leading” (auditory before visual). Consistent with a model in which vision modulates subsequent auditory processing, there was a larger perceptual benefit of visual speech for mouth-leading vs. voice-leading words (28% vs. 4%). The neural substrates of this difference were examined by recording broadband high-frequency activity from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. Responses were smaller for audiovisual vs. auditory-only mouth-leading words (34% difference) while there was little difference (5%) for voice-leading words. Evidence for cross-modal suppression of auditory cortex complements our previous work showing enhancement of visual cortex (Ozker et al., 2018b) and confirms that multisensory interactions are a powerful modulator of activity throughout the speech perception network.Impact StatementHuman perception and brain responses differ between words in which mouth movements are visible before the voice is heard and words for which the reverse is true.


2020 ◽  
Author(s):  
Johannes Rennig ◽  
Michael S Beauchamp

AbstractRegions of the human posterior superior temporal gyrus and sulcus (pSTG/S) respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We hypothesized that these multisensory responses in pSTG/S underlie the observation that comprehension of noisy auditory speech is improved when it is accompanied by visual speech. To test this idea, we presented audiovisual sentences that contained either a clear auditory component or a noisy auditory component while measuring brain activity using BOLD fMRI. Participants reported the intelligibility of the speech on each trial with a button press. Perceptually, adding visual speech to noisy auditory sentences rendered them much more intelligible. Post-hoc trial sorting was used to examine brain activations during noisy sentences that were more or less intelligible, focusing on multisensory speech regions in the pSTG/S identified with an independent visual speech localizer. Univariate analysis showed that less intelligible noisy audiovisual sentences evoked a weaker BOLD response, while more intelligible sentences evoked a stronger BOLD response that was indistinguishable from clear sentences. To better understand these differences, we conducted a multivariate representational similarity analysis. The pattern of response for intelligible noisy audiovisual sentences was more similar to the pattern for clear sentences, while the response pattern for unintelligible noisy sentences was less similar. These results show that for both univariate and multivariate analyses, successful integration of visual and noisy auditory speech normalizes responses in pSTG/S, providing evidence that multisensory subregions of pSTG/S are responsible for the perceptual benefit of visual speech.Significance StatementEnabling social interactions, including the production and perception of speech, is a key function of the human brain. Speech perception is a complex computational problem that the brain solves using both visual information from the talker’s facial movements and auditory information from the talker’s voice. Visual speech information is particularly important under noisy listening conditions when auditory speech is difficult or impossible to understand alone Regions of the human cortex in posterior superior temporal lobe respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We show that the pattern of activity in cortex reflects the successful multisensory integration of auditory and visual speech information in the service of perception.


2019 ◽  
Author(s):  
Patrick J Karas ◽  
John F Magnotti ◽  
Brian A Metzger ◽  
Lin L Zhu ◽  
Kristen B Smith ◽  
...  

2020 ◽  
Author(s):  
Brian A. Metzger ◽  
John F. Magnotti ◽  
Zhengjia Wang ◽  
Elizabeth Nesbitt ◽  
Patrick J. Karas ◽  
...  

AbstractExperimentalists studying multisensory integration compare neural responses to multisensory stimuli with responses to the component modalities presented in isolation. This procedure is problematic for multisensory speech perception since audiovisual speech and auditory-only speech are easily intelligible but visual-only speech is not. To overcome this confound, we developed intracranial encephalography (iEEG) deconvolution. Individual stimuli always contained both auditory and visual speech but jittering the onset asynchrony between modalities allowed for the time course of the unisensory responses and the interaction between them to be independently estimated. We applied this procedure to electrodes implanted in human epilepsy patients (both male and female) over the posterior superior temporal gyrus (pSTG), a brain area known to be important for speech perception. iEEG deconvolution revealed sustained, positive responses to visual-only speech and larger, phasic responses to auditory-only speech. Confirming results from scalp EEG, responses to audiovisual speech were weaker than responses to auditory- only speech, demonstrating a subadditive multisensory neural computation. Leveraging the spatial resolution of iEEG, we extended these results to show that subadditivity is most pronounced in more posterior aspects of the pSTG. Across electrodes, subadditivity correlated with visual responsiveness, supporting a model in visual speech enhances the efficiency of auditory speech processing in pSTG. The ability to separate neural processes may make iEEG deconvolution useful for studying a variety of complex cognitive and perceptual tasks.Significance statementUnderstanding speech is one of the most important human abilities. Speech perception uses information from both the auditory and visual modalities. It has been difficult to study neural responses to visual speech because visual-only speech is difficult or impossible to comprehend, unlike auditory-only and audiovisual speech. We used intracranial encephalography (iEEG) deconvolution to overcome this obstacle. We found that visual speech evokes a positive response in the human posterior superior temporal gyrus, enhancing the efficiency of auditory speech processing.


Neurosurgery ◽  
2019 ◽  
Vol 66 (Supplement_1) ◽  
Author(s):  
Patrick J Karas ◽  
John F Magnotti ◽  
Zhengjia Wang ◽  
Brian A Metzger ◽  
Daniel Yoshor ◽  
...  

Abstract INTRODUCTION Speech is multisensory. The addition of visual speech to auditory speech greatly improves comprehension, especially under noisy auditory conditions. However, the neural mechanism for this visual enhancement of auditory speech is poorly understood. We used electrocorticography (ECoG) to study how auditory, visual, and audiovisual speech is processed in the posterior superior temporal gyrus (pSTG), an area of auditory association cortex involved in audiovisual speech integration. We hypothesized that early visual mouth movements modulate audiovisual speech integration through a mechanism of cross-modal suppression, suggesting that the pSTG response to early mouth movements should correlate with comprehension benefits gained by the addition of visual speech to auditory speech. METHODS Words were presented under auditory-only (AUD), visual-only (VIS), and audiovisual (AV) conditions to epilepsy patients (n = 8) implanted with intracranial electrodes for phase-2 monitoring. We measured high-frequency broadband activity (75-150 Hz), a marker for local neuronal firing, in 28 electrodes over the pSTG. RESULTS The early neural response to visual-only words was compared to the reduction in neural response seen from AUD to AV words, a reduction correlated with an improvement in speech comprehension that occurs with the addition of visual to auditory speech. In words that showed a comprehension benefit with the addition of visual speech, there was a strong early response to visual speech and a correlation between early visual response and the AUD-AV difference (r = 0.64, P = 104). In words where visual speech did not provide any comprehension benefit, there was a weak early visual response and no correlation (r = 0.18, P = .35). CONCLUSION Words with a visual speech comprehension benefit also elicit a strong neural response to early visual speech in pSTG, while words with no comprehension benefit do not cause a strong early response. This suggests that cross-modal suppression of auditory association cortex (pSTG) by early visual plays an important role in audiovisual speech perception.


2017 ◽  
Vol 29 (6) ◽  
pp. 1044-1060 ◽  
Author(s):  
Muge Ozker ◽  
Inga M. Schepers ◽  
John F. Magnotti ◽  
Daniel Yoshor ◽  
Michael S. Beauchamp

Human speech can be comprehended using only auditory information from the talker's voice. However, comprehension is improved if the talker's face is visible, especially if the auditory information is degraded as occurs in noisy environments or with hearing loss. We explored the neural substrates of audiovisual speech perception using electrocorticography, direct recording of neural activity using electrodes implanted on the cortical surface. We observed a double dissociation in the responses to audiovisual speech with clear and noisy auditory component within the superior temporal gyrus (STG), a region long known to be important for speech perception. Anterior STG showed greater neural activity to audiovisual speech with clear auditory component, whereas posterior STG showed similar or greater neural activity to audiovisual speech in which the speech was replaced with speech-like noise. A distinct border between the two response patterns was observed, demarcated by a landmark corresponding to the posterior margin of Heschl's gyrus. To further investigate the computational roles of both regions, we considered Bayesian models of multisensory integration, which predict that combining the independent sources of information available from different modalities should reduce variability in the neural responses. We tested this prediction by measuring the variability of the neural responses to single audiovisual words. Posterior STG showed smaller variability than anterior STG during presentation of audiovisual speech with noisy auditory component. Taken together, these results suggest that posterior STG but not anterior STG is important for multisensory integration of noisy auditory and visual speech.


2002 ◽  
Vol 88 (1) ◽  
pp. 540-543 ◽  
Author(s):  
John J. Foxe ◽  
Glenn R. Wylie ◽  
Antigona Martinez ◽  
Charles E. Schroeder ◽  
Daniel C. Javitt ◽  
...  

Using high-field (3 Tesla) functional magnetic resonance imaging (fMRI), we demonstrate that auditory and somatosensory inputs converge in a subregion of human auditory cortex along the superior temporal gyrus. Further, simultaneous stimulation in both sensory modalities resulted in activity exceeding that predicted by summing the responses to the unisensory inputs, thereby showing multisensory integration in this convergence region. Recently, intracranial recordings in macaque monkeys have shown similar auditory-somatosensory convergence in a subregion of auditory cortex directly caudomedial to primary auditory cortex (area CM). The multisensory region identified in the present investigation may be the human homologue of CM. Our finding of auditory-somatosensory convergence in early auditory cortices contributes to mounting evidence for multisensory integration early in the cortical processing hierarchy, in brain regions that were previously assumed to be unisensory.


2012 ◽  
Vol 25 (0) ◽  
pp. 148
Author(s):  
Marcia Grabowecky ◽  
Emmanuel Guzman-Martinez ◽  
Laura Ortega ◽  
Satoru Suzuki

Watching moving lips facilitates auditory speech perception when the mouth is attended. However, recent evidence suggests that visual attention and awareness are mediated by separate mechanisms. We investigated whether lip movements suppressed from visual awareness can facilitate speech perception. We used a word categorization task in which participants listened to spoken words and determined as quickly and accurately as possible whether or not each word named a tool. While participants listened to the words they watched a visual display that presented a video clip of the speaker synchronously speaking the auditorily presented words, or the same speaker articulating different words. Critically, the speaker’s face was either visible (the aware trials), or suppressed from awareness using continuous flash suppression. Aware and suppressed trials were randomly intermixed. A secondary probe-detection task ensured that participants attended to the mouth region regardless of whether the face was visible or suppressed. On the aware trials responses to the tool targets were no faster with the synchronous than asynchronous lip movements, perhaps because the visual information was inconsistent with the auditory information on 50% of the trials. However, on the suppressed trials responses to the tool targets were significantly faster with the synchronous than asynchronous lip movements. These results demonstrate that even when a random dynamic mask renders a face invisible, lip movements are processed by the visual system with sufficiently high temporal resolution to facilitate speech perception.


2020 ◽  
Vol 117 (29) ◽  
pp. 16920-16927 ◽  
Author(s):  
John Plass ◽  
David Brang ◽  
Satoru Suzuki ◽  
Marcia Grabowecky

Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time–frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.


Sign in / Sign up

Export Citation Format

Share Document