Time course of audio–visual phoneme identification: A cross-modal gating study

2012 ◽  
Vol 25 (0) ◽  
pp. 194
Author(s):  
Carolina Sánchez-García ◽  
Sonia Kandel ◽  
Christophe Savariaux ◽  
Nara Ikumi ◽  
Salvador Soto-Faraco

When both present, visual and auditory information are combined in order to decode the speech signal. Past research has addressed to what extent visual information contributes to distinguish confusable speech sounds, but usually ignoring the continuous nature of speech perception. Here we tap at the temporal course of the contribution of visual and auditory information during the process of speech perception. To this end, we designed an audio–visual gating task with videos recorded with high speed camera. Participants were asked to identify gradually longer fragments of pseudowords varying in the central consonant. Different Spanish consonant phonemes with different degree of visual and acoustic saliency were included, and tested on visual-only, auditory-only and audio–visual trials. The data showed different patterns of contribution of unimodal and bimodal information during identification, depending on the visual saliency of the presented phonemes. In particular, for phonemes which are clearly more salient in one modality than the other, audio–visual performance equals that of the best unimodal. In phonemes with more balanced saliency, audio–visual performance was better than both unimodal conditions. These results shed new light on the temporal course of audio–visual speech integration.

2019 ◽  
Author(s):  
Patrick J. Karas ◽  
John F. Magnotti ◽  
Brian A. Metzger ◽  
Lin L. Zhu ◽  
Kristen B. Smith ◽  
...  

AbstractVision provides a perceptual head start for speech perception because most speech is “mouth-leading”: visual information from the talker’s mouth is available before auditory information from the voice. However, some speech is “voice-leading” (auditory before visual). Consistent with a model in which vision modulates subsequent auditory processing, there was a larger perceptual benefit of visual speech for mouth-leading vs. voice-leading words (28% vs. 4%). The neural substrates of this difference were examined by recording broadband high-frequency activity from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. Responses were smaller for audiovisual vs. auditory-only mouth-leading words (34% difference) while there was little difference (5%) for voice-leading words. Evidence for cross-modal suppression of auditory cortex complements our previous work showing enhancement of visual cortex (Ozker et al., 2018b) and confirms that multisensory interactions are a powerful modulator of activity throughout the speech perception network.Impact StatementHuman perception and brain responses differ between words in which mouth movements are visible before the voice is heard and words for which the reverse is true.


2011 ◽  
Vol 23 (1) ◽  
pp. 221-237 ◽  
Author(s):  
Ingo Hertrich ◽  
Susanne Dietrich ◽  
Hermann Ackermann

During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream—prior to its fusion with auditory phonological features [Hertrich, I., Mathiak, K., Lutzenberger, W., & Ackermann, H. Time course of early audiovisual interactions during speech and non-speech central-auditory processing: An MEG study. Journal of Cognitive Neuroscience, 21, 259–274, 2009]. Using functional magnetic resonance imaging, the present follow-up study aims to further elucidate the topographic distribution of visual–phonological operations and audiovisual (AV) interactions during speech perception. Ambiguous acoustic syllables—disambiguated to /pa/ or /ta/ by the visual channel (speaking face)—served as test materials, concomitant with various control conditions (nonspeech AV signals, visual-only and acoustic-only speech, and nonspeech stimuli). (i) Visual speech yielded an AV-subadditive activation of primary auditory cortex and the anterior superior temporal gyrus (STG), whereas the posterior STG responded both to speech and nonspeech motion. (ii) The inferior frontal and the fusiform gyrus of the right hemisphere showed a strong phonetic/phonological impact (differential effects of visual /pa/ vs. /ta/) upon hemodynamic activation during presentation of speaking faces. Taken together with the previous MEG data, these results point at a dual-pathway model of visual speech information processing: On the one hand, access to the auditory system via the anterior supratemporal “what” path may give rise to direct activation of “auditory objects.” On the other hand, visual speech information seems to be represented in a right-hemisphere visual working memory, providing a potential basis for later interactions with auditory information such as the McGurk effect.


eLife ◽  
2018 ◽  
Vol 7 ◽  
Author(s):  
Muge Ozker ◽  
Daniel Yoshor ◽  
Michael S Beauchamp

Human faces contain multiple sources of information. During speech perception, visual information from the talker’s mouth is integrated with auditory information from the talker's voice. By directly recording neural responses from small populations of neurons in patients implanted with subdural electrodes, we found enhanced visual cortex responses to speech when auditory speech was absent (rendering visual speech especially relevant). Receptive field mapping demonstrated that this enhancement was specific to regions of the visual cortex with retinotopic representations of the mouth of the talker. Connectivity between frontal cortex and other brain regions was measured with trial-by-trial power correlations. Strong connectivity was observed between frontal cortex and mouth regions of visual cortex; connectivity was weaker between frontal cortex and non-mouth regions of visual cortex or auditory cortex. These results suggest that top-down selection of visual information from the talker’s mouth by frontal cortex plays an important role in audiovisual speech perception.


2012 ◽  
Vol 25 (0) ◽  
pp. 148
Author(s):  
Marcia Grabowecky ◽  
Emmanuel Guzman-Martinez ◽  
Laura Ortega ◽  
Satoru Suzuki

Watching moving lips facilitates auditory speech perception when the mouth is attended. However, recent evidence suggests that visual attention and awareness are mediated by separate mechanisms. We investigated whether lip movements suppressed from visual awareness can facilitate speech perception. We used a word categorization task in which participants listened to spoken words and determined as quickly and accurately as possible whether or not each word named a tool. While participants listened to the words they watched a visual display that presented a video clip of the speaker synchronously speaking the auditorily presented words, or the same speaker articulating different words. Critically, the speaker’s face was either visible (the aware trials), or suppressed from awareness using continuous flash suppression. Aware and suppressed trials were randomly intermixed. A secondary probe-detection task ensured that participants attended to the mouth region regardless of whether the face was visible or suppressed. On the aware trials responses to the tool targets were no faster with the synchronous than asynchronous lip movements, perhaps because the visual information was inconsistent with the auditory information on 50% of the trials. However, on the suppressed trials responses to the tool targets were significantly faster with the synchronous than asynchronous lip movements. These results demonstrate that even when a random dynamic mask renders a face invisible, lip movements are processed by the visual system with sufficiently high temporal resolution to facilitate speech perception.


2019 ◽  
Vol 62 (2) ◽  
pp. 307-317 ◽  
Author(s):  
Jianghua Lei ◽  
Huina Gong ◽  
Liang Chen

Purpose The study was designed primarily to determine if the use of hearing aids (HAs) in individuals with hearing impairment in China would affect their speechreading performance. Method Sixty-seven young adults with hearing impairment with HAs and 78 young adults with hearing impairment without HAs completed newly developed Chinese speechreading tests targeting 3 linguistic levels (i.e., words, phrases, and sentences). Results Groups with HAs were more accurate at speechreading than groups without HA across the 3 linguistic levels. For both groups, speechreading accuracy was higher for phrases than words and sentences, and speechreading speed was slower for sentences than words and phrases. Furthermore, there was a positive correlation between years of HA use and the accuracy of speechreading performance; longer HA use was associated with more accurate speechreading. Conclusions Young HA users in China have enhanced speechreading performance over their peers with hearing impairment who are not HA users. This result argues against the perceptual dependence hypothesis that suggests greater dependence on visual information leads to improvement in visual speech perception.


Author(s):  
Karthik Ganesan ◽  
John Plass ◽  
Adriene M. Beltz ◽  
Zhongming Liu ◽  
Marcia Grabowecky ◽  
...  

AbstractSpeech perception is a central component of social communication. While speech perception is primarily driven by sounds, accurate perception in everyday settings is also supported by meaningful information extracted from visual cues (e.g., speech content, timing, and speaker identity). Previous research has shown that visual speech modulates activity in cortical areas subserving auditory speech perception, including the superior temporal gyrus (STG), likely through feedback connections from the multisensory posterior superior temporal sulcus (pSTS). However, it is unknown whether visual modulation of auditory processing in the STG is a unitary phenomenon or, rather, consists of multiple temporally, spatially, or functionally discrete processes. To explore these questions, we examined neural responses to audiovisual speech in electrodes implanted intracranially in the temporal cortex of 21 patients undergoing clinical monitoring for epilepsy. We found that visual speech modulates auditory processes in the STG in multiple ways, eliciting temporally and spatially distinct patterns of activity that differ across theta, beta, and high-gamma frequency bands. Before speech onset, visual information increased high-gamma power in the posterior STG and suppressed beta power in mid-STG regions, suggesting crossmodal prediction of speech signals in these areas. After sound onset, visual speech decreased theta power in the middle and posterior STG, potentially reflecting a decrease in sustained feedforward auditory activity. These results are consistent with models that posit multiple distinct mechanisms supporting audiovisual speech perception.Significance StatementVisual speech cues are often needed to disambiguate distorted speech sounds in the natural environment. However, understanding how the brain encodes and transmits visual information for usage by the auditory system remains a challenge. One persistent question is whether visual signals have a unitary effect on auditory processing or elicit multiple distinct effects throughout auditory cortex. To better understand how vision modulates speech processing, we measured neural activity produced by audiovisual speech from electrodes surgically implanted in auditory areas of 21 patients with epilepsy. Group-level statistics using linear mixed-effects models demonstrated distinct patterns of activity across different locations, timepoints, and frequency bands, suggesting the presence of multiple audiovisual mechanisms supporting speech perception processes in auditory cortex.


Author(s):  
Grant McGuire ◽  
Molly Babel

AbstractWhile the role of auditory saliency is well accepted as providing insight into the shaping of phonological systems, the influence of visual saliency on such systems has been neglected. This paper provides evidence for the importance of visual information in historical phonological change and synchronic variation through a series of audio-visual experiments with the /f/∼/θ/ contrast. /θ/ is typologically rare, an atypical target in sound change, acquired comparatively late, and synchronically variable in language inventories. Previous explanations for these patterns have focused on either the articulatory difficulty of an interdental tongue gesture or the perceptual similarity /θ/ shares with labiodental fricatives. We hypothesize that the bias is due to an asymmetry in audio-visual phonetic cues and cue variability within and across talkers. Support for this hypothesis comes from a speech perception study that explored the weighting of audio and visual cues for /f/ and /θ/ identification in CV, VC, and VCV syllabic environments in /i/, /a/, or /u/ vowel contexts in Audio, Visual, and Audio-Visual experimental conditions using stimuli from ten different talkers. The results indicate that /θ/ is more variable than /f/, both in Audio and Visual conditions. We propose that it is this variability which contributes to the unstable nature of /θ/ across time and offers an improved explanation for the observed synchronic and diachronic asymmetries in its patterning.


2018 ◽  
Vol 5 (3) ◽  
pp. 170909 ◽  
Author(s):  
Claudia S. Lüttke ◽  
Alexis Pérez-Bellido ◽  
Floris P. de Lange

The human brain can quickly adapt to changes in the environment. One example is phonetic recalibration: a speech sound is interpreted differently depending on the visual speech and this interpretation persists in the absence of visual information. Here, we examined the mechanisms of phonetic recalibration. Participants categorized the auditory syllables /aba/ and /ada/, which were sometimes preceded by the so-called McGurk stimuli (in which an /aba/ sound, due to visual /aga/ input, is often perceived as ‘ada’). We found that only one trial of exposure to the McGurk illusion was sufficient to induce a recalibration effect, i.e. an auditory /aba/ stimulus was subsequently more often perceived as ‘ada’. Furthermore, phonetic recalibration took place only when auditory and visual inputs were integrated to ‘ada’ (McGurk illusion). Moreover, this recalibration depended on the sensory similarity between the preceding and current auditory stimulus. Finally, signal detection theoretical analysis showed that McGurk-induced phonetic recalibration resulted in both a criterion shift towards /ada/ and a reduced sensitivity to distinguish between /aba/ and /ada/ sounds. The current study shows that phonetic recalibration is dependent on the perceptual integration of audiovisual information and leads to a perceptual shift in phoneme categorization.


2018 ◽  
Vol 28 (1) ◽  
pp. 47-52 ◽  
Author(s):  
Karrie E. Godwin ◽  
Lucy C. Erickson ◽  
Rochelle S. Newman

Many learning tasks that children encounter necessitate the ability to direct and sustain attention to key aspects of the environment while simultaneously tuning out irrelevant features. This is challenging for at least two reasons: (a) The ability to regulate and sustain attention follows a protracted developmental time course, and (b) children spend much of their time in environments not optimized for learning—homes and schools are often chaotic, cluttered, and noisy. Research on these issues is often siloed; that is, researchers tend to examine the relationship among attention, distraction, and learning in only the auditory or the visual domain, but not both together. We provide examples in which auditory and visual aspects of learning each have strong implications for the other. Research examining how visual information and auditory information are distracting can benefit from cross-fertilization. Integrating across research silos informs our understanding of attention and learning, yielding more efficacious guidance for caregivers, educators, developers, and policymakers.


2020 ◽  
Vol 21 (1) ◽  
pp. 349-358
Author(s):  
O. Brendel

A problematic issue that frequently arises in the examination of video and audio recordings, namely the question of visual and auditory perception of oral speech – the establishment of the content of a conversation based on its image (lip reading) – is considered. The article purpose is to analyze the possibility and feasibility of examining the visual-auditory perception of oral speech in the framework of the examination of video and sound recordings, considering the peculiarities of such research; the ability to use visual information either as an independent object of examination (lip reading), or as a supplementary, additional to auditory analysis of a particular message. The main components of the process of lip reading, the possibility of visual examination of visual and auditory information in order to establish the content of a conversation are considered. Attention is paid to the features of visual and auditory perception of oral speech, and the factors that contribute enormously to the informative nature of the overall picture of oral speech perception by an image are analyzed. The influence of the visual image on the speech perception by an image is considered, such as active articulation, facial expressions, head movement, position of teeth, gestures, etc. In addition to the quality of the image, the duration of the speech fragment also affects the perception of oral speech by the image: a fully uttered expression is usually read better than its individual parts. The article also draws attention to the ambiguity of articulatory images of sounds. The features of the McGurk effect – a perception phenomenon that demonstrates the interaction between hearing and vision while the perception of speech – are considered. The analysis of the possibility and feasibility of examining visual and auditory perception of oral speech within the framework of the examination of video and sound recordings is carried out, and the peculiarities of such research are highlighted.


Sign in / Sign up

Export Citation Format

Share Document