scholarly journals Rethinking the Mechanisms Underlying the McGurk Illusion

2021 ◽  
Vol 15 ◽  
Author(s):  
Mariel G. Gonzales ◽  
Kristina C. Backer ◽  
Brenna Mandujano ◽  
Antoine J. Shahin

The McGurk illusion occurs when listeners hear an illusory percept (i.e., “da”), resulting from mismatched pairings of audiovisual (AV) speech stimuli (i.e., auditory/ba/paired with visual/ga/). Hearing a third percept—distinct from both the auditory and visual input—has been used as evidence of AV fusion. We examined whether the McGurk illusion is instead driven by visual dominance, whereby the third percept, e.g., “da,” represents a default percept for visemes with an ambiguous place of articulation (POA), like/ga/. Participants watched videos of a talker uttering various consonant vowels (CVs) with (AV) and without (V-only) audios of/ba/. Individuals transcribed the CV they saw (V-only) or heard (AV). In the V-only condition, individuals predominantly saw “da”/“ta” when viewing CVs with indiscernible POAs. Likewise, in the AV condition, upon perceiving an illusion, they predominantly heard “da”/“ta” for CVs with indiscernible POAs. The illusion was stronger in individuals who exhibited weak/ba/auditory encoding (examined using a control auditory-only task). In Experiment2, we attempted to replicate these findings using stimuli recorded from a different talker. The V-only results were not replicated, but again individuals predominately heard “da”/“ta”/“tha” as an illusory percept for various AV combinations, and the illusion was stronger in individuals who exhibited weak/ba/auditory encoding. These results demonstrate that when visual CVs with indiscernible POAs are paired with a weakly encoded auditory/ba/, listeners default to hearing “da”/“ta”/“tha”—thus, tempering the AV fusion account, and favoring a default mechanism triggered when both AV stimuli are ambiguous.

2004 ◽  
Vol 16 (1) ◽  
pp. 31-39 ◽  
Author(s):  
Jonas Obleser ◽  
Aditi Lahiri ◽  
Carsten Eulitz

This study further elucidates determinants of vowel perception in the human auditory cortex. The vowel inventory of a given language can be classified on the basis of phonological features which are closely linked to acoustic properties. A cortical representation of speech sounds based on these phonological features might explain the surprisingly inverse correlation between immense variance in the acoustic signal and high accuracy of speech recognition. We investigated timing and mapping of the N100m elicited by 42 tokens of seven natural German vowels varying along the phonological features tongue height (corresponding to the frequency of the first formant) and place of articulation (corresponding to the frequency of the second and third formants). Auditoryevoked fields were recorded using a 148-channel whole-head magnetometer while subjects performed target vowel detection tasks. Source location differences appeared to be driven by place of articulation: Vowels with mutually exclusive place of articulation features, namely, coronal and dorsal elicited separate centers of activation along the posterior-anterior axis. Additionally, the time course of activation as reflected in the N100m peak latency distinguished between vowel categories especially when the spatial distinctiveness of cortical activation was low. In sum, results suggest that both N100m latency and source location as well as their interaction reflect properties of speech stimuli that correspond to abstract phonological features.


2011 ◽  
Vol 15 (2) ◽  
pp. 255-274 ◽  
Author(s):  
ERIN M. INGVALSON ◽  
LORI L. HOLT ◽  
JAMES L. McCLELLAND

Many attempts have been made to teach native Japanese listeners to perceptually differentiate English /r–l/ (e.g.rock–lock). Though improvement is evident, in no case is final performance native English-like. We focused our training on the third formant onset frequency, shown to be the most reliable indicator of /r–l/ category membership. We first presented listeners with instances of synthetic /r–l/ stimuli varying only in F3 onset frequency, in a forced-choice identification training task with feedback. Evidence of learning was limited. The second experiment utilized an adaptive paradigm beginning with non-speech stimuli consisting only of /r/ and /l/ F3 frequency trajectories progressing to synthetic speech instances of /ra–la/; half of the trainees received feedback. Improvement was shown by some listeners, suggesting some enhancement of /r–l/ identification is possible following training with only F3 onset frequency. However, only a subset of these listeners showed signs of generalization of the training effect beyond the trained synthetic context.


Author(s):  
Tareq Ibrahim Al-Ziyadat

The study aims to elucidate plosiveness and friction in the “Raa”, (the tenth alphabet in Arabic) benefiting from what the ancient and modern scholars said on the issue. The core issue of the study is Sibawey’s classification of the “Raa” a tense phoneme in whose articulation the sound repeatedly flows leaning toward the articulation of “Lam” (23 Arabic alphabet) avoiding laxity. Had not the sound repeated, we wouldn’t have had the “Raa”. Tensity (plosiveness) and frication are two contradictory features which can never have the same place of articulation. The sound is articulated at stages, each of which has its own features. After analysis, it was found that the articulation of “Raa” passes through three stages. In the second, in the space between vocal cords and top of the tongue the “Raa” is fricative, while in the third, the closure stage between top of the tongue and hard palate, the “Raa” is plosive but this plosiveness is less in intensity than that of plosive phonemes. Therefore the “Raa” can be neither plosive, nor fricative, but in between “medial”.


Author(s):  
Susanne Fuchs ◽  
Peter Birkholz

Consonants are a major class of sounds occurring in all human languages. Typologically, consonant inventories are richer than vowel inventories. Consonants have been classified according to four basic features. Airstream mechanism is one of these features and describes the direction of airflow in or out of the oral cavity. The outgoing airflow is further separated according to its origin, that is, air coming from the lungs (pulmonic) or the oral cavity (non-pulmonic). Consonants are also grouped according to their phonological voicing contrast, which can be manifested phonetically by the presence or absence of vocal fold oscillations during the oral closure/constriction phase and by the duration from an oral closure release to the onset of voicing. Place of articulation is the third feature and refers to the location at which a consonantal constriction or closure is produced in the vocal tract. Finally, manner of articulation reflects different timing and coordinated actions of the articulators closely tied to aerodynamic properties.


2016 ◽  
Vol 60 (1) ◽  
pp. 27-47 ◽  
Author(s):  
Kelly Richardson ◽  
Joan E Sussman

Typically-developing children, 4 to 6 years of age, and adults participated in discrimination and identification speech perception tasks using a synthetic consonant–vowel continuum ranging from /da/ to /ga/. The seven-step synthetic /da/–/ga/ continuum was created by adjusting the first 40 ms of the third formant frequency transition. For the discrimination task, listeners participated in a Change/No–Change paradigm with four different stimuli compared to the endpoint-1 /da/ token. For the identification task, listeners labeled each token along the /da/–/ga/ continuum as either “DA” or “GA.” Results of the discrimination experiment showed that sensitivity to the third-formant transition cue improved for the adult listeners as the stimulus contrast increased, whereas the performance of the children remained poor across all stimulus comparisons. Results of the identification experiment support previous hypotheses of age-related differences in phonetic categorization. Results have implications for normative data on identification and discrimination tasks. These norms provide a metric against which children with auditory-based speech sound disorders can be compared. Furthermore, the results provide some insight into the developmental nature of categorical and non-categorical speech perception.


2013 ◽  
Vol 56 (3) ◽  
pp. 779-791 ◽  
Author(s):  
Catherine Mayo ◽  
Fiona Gibbon ◽  
Robert A. J. Clark

Purpose In this study, the authors aimed to investigate how listener training and the presence of intermediate acoustic cues influence transcription variability for conflicting cue speech stimuli. Method Twenty listeners with training in transcribing disordered speech, and 26 untrained listeners, were asked to make forced-choice labeling decisions for synthetic vowel–consonant–vowel (VCV) sequences “a doe” (/ədo/) and “a go” (/əgo/). Both the VC and CV transitions in these stimuli ranged through intermediate positions, from appropriate for /d/ to appropriate for /g/. Results Both trained and untrained listeners gave more weight to the CV transitions than to the VC transitions. However, listener behavior was not uniform: The results showed a high level of inter- and intratranscriber inconsistency, with untrained listeners showing a nonsignificant tendency to be more influenced than trained listeners by CV transitions. Conclusions Listeners do not assign consistent categorical labels to the type of intermediate, conflicting transitional cues that were present in the stimuli used in the current study and that are also present in disordered articulations. Although listener inconsistency in assigning labels to intermediate productions is not increased as a result of phonetic training, neither is it reduced by such training.


2007 ◽  
Vol 18 (07) ◽  
pp. 590-603 ◽  
Author(s):  
Larry E. Humes

In this review of recent studies from our laboratory at Indiana University, it is argued that audibility is the primary contributor to the speech-understanding difficulties of older adults in unaided listening, but that other factors, especially cognitive factors, emerge when the role of audibility has been minimized. The advantages and disadvantages of three basic approaches used in our laboratory to minimize the role of audibility are examined. The first of these made use of clinical fits of personal amplification devices, but generally failed to make the aided speech stimuli sufficiently audible for the listeners. As a result, hearing loss remained the predominant predictor of performance. The second approach made use of raised and spectrally shaped stimuli with identical shaping applied for all listeners. The third approach used spectrally shaped speech that ensured audibility (at least 10 dB sensation level) of the stimuli up to at least 4000 Hz for each individual listener. With few exceptions, the importance of cognitive factors was revealed once the speech stimuli were made sufficiently audible. En esta revisión de estudios recientes de nuestro laboratorio en la Universidad de Indiana, se argumenta que la audibilidad es el factor primario que contribuye en las dificultades para el entendimiento del lenguaje en adultos mayores, bajo condiciones no amplificadas de escucha, pero que existen otros factores, especialmente cognitivos, que emergen cuando el papel de la audibilidad ha sido minimizado. Se examinan las ventajas y desventajas de los tres enfoques básicos utilizados en nuestro laboratorio para minimizar el papel de la audibilidad. El primero de estos hace uso de los ajustes clínicos en dispositivos personales de amplificación, pero que fallaron en convertir los estímulos amplificados de lenguaje en algo suficientemente audible para el sujeto. Como resultado, la hipoacusia continuó siendo el factor de predicción predominante en el desempeño. El segundo enfoque hizo uso de estímulos aumentados y moldeados espectralmente, con un moldeado idéntico para todos los sujetos. El tercer enfoque utilizó lenguaje moldeado espectralmente que aseguraba la audibilidad del estímulo (al menos a 10 dB de nivel de sensación) hasta al menos 4000 Hz para cada sujeto individual. Con pocas excepciones, la importancia de los factores cognitivos se reveló una vez que los estímulos de lenguaje habían sido hechos suficientemente audibles.


2021 ◽  
pp. 1-12
Author(s):  
Sandhya ◽  
Vinay ◽  
Manchaiah, V

Purpose Multimodal sensory integration in audiovisual (AV) speech perception is a naturally occurring phenomenon. Modality-specific responses such as auditory left, auditory right, and visual responses to dichotic incongruent AV speech stimuli help in understanding AV speech processing through each input modality. It is observed that distribution of activity in the frontal motor areas involved in speech production has been shown to correlate with how subjects perceive the same syllable differently or perceive different syllables. This study investigated the distribution of modality-specific responses to dichotic incongruent AV speech stimuli by simultaneously presenting consonant–vowel (CV) syllables with different places of articulation to the participant's left and right ears and visually. Design A dichotic experimental design was adopted. Six stop CV syllables /pa/, /ta/, /ka/, /ba/, /da/, and /ga/ were assembled to create dichotic incongruent AV speech material. Participants included 40 native speakers of Norwegian (20 women, M age = 22.6 years, SD = 2.43 years; 20 men, M age = 23.7 years, SD = 2.08 years). Results Findings of this study showed that, under dichotic listening conditions, velar CV syllables resulted in the highest scores in the respective ears, and this might be explained by stimulus dominance of velar consonants, as shown in previous studies. However, this study, with dichotic auditory stimuli accompanied by an incongruent video segment, demonstrated that the presentation of a visually distinct video segment possibly draws attention to the video segment in some participants, thereby reducing the overall recognition of the dominant syllable. Furthermore, the findings here suggest the possibility of lesser response times to incongruent AV stimuli in females compared with males. Conclusion The identification of the left audio, right audio, and visual segments in dichotic incongruent AV stimuli depends on place of articulation, stimulus dominance, and voice onset time of the CV syllables.


Author(s):  
Azra N. Ali ◽  
Michael Ingleby

AbstractOver the last three decades, priming and masking experiments and corpus frequency studies have dominated attempts to find ranking in the decomposability of words containing morphological affixes. Here we establish feasibility of using another experimental probe based on audiovisually incongruent speech stimuli. In response to such stimuli, a proportion of participants report percepts that differ in place of articulation from either the audio or the visual signal, typically reporting percept /t/ when receiving audio /p/ dubbed onto visual /k/. We study the systematic variation of this proportion, the McGurk fusion rate, using a small corpus with affixes


2012 ◽  
Vol 25 (0) ◽  
pp. 9
Author(s):  
Elliot D. Freeman ◽  
Alberta Ipser

Due to physical and neural delays, the sight and sound of a person speaking causes a cachophony of asynchronous events in the brain. How can we still perceive them as simultaneous? Our converging evidence suggests that actually, we do not. Patient PH, with midbrain and auditory brainstem lesions, experiences voices leading lipmovements by approximately 200 ms. In temporal order judgements (TOJ) he experiences simultaneity only when voices physically lag lips. In contrast, he requires the opposite visual lag (again of about 200 ms) to experience the classic McGurk illusion (e.g., hearing ‘da’ when listening to /ba/ and watching lips say [ga]), consistent with pathological auditory slowing. These delays seem to be specific to speech stimuli. Is PH just an anomaly? Surprisingly, neuro-typical individual differences between temporal tuning of McGurk integration and TOJ are actually negatively correlated. Thus some people require a small auditory lead for optimal McG but an auditory lag for subjective simultaneity (like PH but not as extreme), while others show the opposite pattern. Evidently, any individual can concurrently experience the same external events as happening at different times. These dissociative patterns confirm that distinct mechanisms for audiovisual synchronization versus integration are each subject to different neural delays. To explain the apparent repulsion of their respective timings, we propose that multimodal synchronization is achieved by discounting the average neural event time within each modality. Lesions or individual differences which slow the propagation of neural signals will then attract the average, so that relatively undelayed neural signals will be experienced as occurring relatively early.


Sign in / Sign up

Export Citation Format

Share Document