scholarly journals Unity assumption between face and voice modulates audiovisual temporal recalibration

2021 ◽  
Author(s):  
Kyuto Uno ◽  
Kazuhiko Yokosawa

Audiovisual temporal recalibration refers to a shift in the point of subjective simultaneity (PSS) between audio and visual signals triggered by prolonged exposure to asynchronies between these signals. Previous research indicated that the spatial proximity of audiovisual signals can be a determinant of which pairs of signals are temporally recalibrated when multiple events compete for recalibration. Here we show that temporal recalibration is modulated by an observer’s assumption that the audiovisual signals originate from the same unitary event (“unity assumption”). Participants were shown alternating face photos and voices of the male and female speakers. These stimuli were presented equally spaced in time, and the voices were presented monaurally through headphones, such that no spatiotemporal-based grouping was implied for these stimuli. There were two conditions for the stimulus sequence in the adaptation phase: one in which a face photo always preceded its corresponding voice within each pairing of audiovisual stimuli (i.e., multiple repetitions of the sequence: female voice – male face – male voice – female voice), and the other one in which the corresponding voice always preceded its face photo. We found a shift in the PSS between these audiovisual signals towards the temporal order for the same gender person. The results show that the unity assumption between face photos and voices affects temporal recalibration, indicating the possibility that the brain selectively recalibrates the asynchronies of audiovisual signals that are considered to originate from the same unitary event in a cluttered environment.

2015 ◽  
Vol 28 (3-4) ◽  
pp. 351-370 ◽  
Author(s):  
Hao Tam Ho ◽  
Hao Tam Ho ◽  
Emily Orchard-Mills ◽  
Hao Tam Ho ◽  
Emily Orchard-Mills ◽  
...  

Following prolonged exposure to audiovisual asynchrony, an observer’s point of subjective simultaneity (PSS) shifts in the direction of the leading modality. It has been debated whether other sensory pairings, such as vision and touch, lead to a similar temporal recalibration, and if so, whether the internal timing mechanism underlying lag visuotactile adaptation is centralised or distributed. To address these questions, we adapted observers to vision- and tactile-leading visuotactile asynchrony on either their left or right hand side in different blocks. In one test condition, participants performed a simultaneity judgment on the adapted side (unilateral) and in another they performed a simultaneity judgment on the non-adapted side (contralateral). In a third condition, participants adapted concurrently to equal and opposite asynchronies on each side and were tested randomly on either hand (bilateral opposed). Results from the first two conditions show that observers recalibrate to visuotactile asynchronies, and that the recalibration transfers to the non-adapted side. These findings suggest a centralised recalibration mechanism not linked to the adapted side and predict no recalibration for the bilateral opposed condition, assuming the adapted effects were equal on each side. This was confirmed in the group of participants that adapted to vision- and tactile-leading asynchrony on the right and left hand side, respectively. However, the other group (vision-leading on the left and tactile-leading on the right) did show a recalibration effect, suggesting a distributed mechanism. We discuss these findings in terms of a hybrid model that assumes the co-existence of a centralised and distributed timing mechanism.


2015 ◽  
Vol 282 (1804) ◽  
pp. 20143083 ◽  
Author(s):  
Erik Van der Burg ◽  
Patrick T. Goodbourn

The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an actor pronouncing a single syllable. The voice was either advanced or delayed relative to the corresponding lip movements, and participants were asked to make a synchrony judgement. Although we did not use an explicit adaptation procedure, we demonstrate rapid recalibration based on a single audiovisual event. We find that the point of subjective simultaneity on each trial is highly contingent upon the modality order of the preceding trial. We find compelling evidence that rapid recalibration generalizes across different stimuli, and different actors. Finally, we demonstrate that rapid recalibration occurs even when auditory and visual events clearly belong to different actors. These results suggest that rapid temporal recalibration to audiovisual speech is primarily mediated by basic temporal factors, rather than higher-order factors such as perceived simultaneity and source identity.


2010 ◽  
Vol 278 (1705) ◽  
pp. 535-538 ◽  
Author(s):  
Derek H. Arnold ◽  
Kielan Yarrow

Our sense of relative timing is malleable. For instance, visual signals can be made to seem synchronous with earlier sounds following prolonged exposure to an environment wherein auditory signals precede visual ones. Similarly, actions can be made to seem to precede their own consequences if an artificial delay is imposed for a period, and then removed. Here, we show that our sense of relative timing for combinations of visual changes is similarly pliant. We find that direction reversals can be made to seem synchronous with unusually early colour changes after prolonged exposure to a stimulus wherein colour changes precede direction changes. The opposite effect is induced by prolonged exposure to colour changes that lag direction changes. Our data are consistent with the proposal that our sense of timing for changes encoded by distinct sensory mechanisms can adjust, at least to some degree, to the prevailing environment. Moreover, they reveal that visual analyses of colour and motion are sufficiently independent for this to occur.


Perception ◽  
1993 ◽  
Vol 22 (8) ◽  
pp. 963-970 ◽  
Author(s):  
Piotr Jaśkowski

Point of subjective simultaneity and simple reaction time were compared for stimuli with different rise times. It was found that these measures behave differently. To explain the result it is suggested that in the case of temporal-order judgment the subject takes into account not only the stimulus onset but also other events connected with stimulus presentation.


2019 ◽  
Vol 9 (8) ◽  
pp. 1607-1613
Author(s):  
Xinxin Sun ◽  
Wenkui Jin

Purpose: To analyze the elderly patients' demand in the field of language emotional attributes which is different from the particularity of the general population, and then used it to design the Smart Voice Assistant for elderly patients. Methods: The types of character are called the experts, assistants, family members and pets according to two evaluation scales which are 'friendly–unfriendly' and 'dominant–submissive.' Professional Mandarin Chinese broadcasters were invited to record their voice of a neutral sentence respectively. With differences in pitch, volume, tone color and sound quality, varied types of characters were simulated and eight pieces of voice record were generated. Twenty old people were selected to participate in the measurement test to grade these eight pieces of voice based on their acceptability of these voices. Results: Experiment shows that the male test group's preference value to the male voice is 4.4250, a little higher than their preference value to the female voice which is 3.9000. But the Independent Samples T Test shows that the significance of F Test is 0.345 > 0.05, while that of corresponding T test is 0.051, a little higher than 0.05. Hence it can be assumed that the male test group has almost no difference in the preference for voice gender. And the average of the female test group's preference value is 3.9500 for the male voice and 4.9750 for the female voice. Therefore, the female test group prefer female voice. In Independent Samples T Test, the significance of F Test is 0.909 > 0.05, while that of T Test is 0.000 < 0.05. Hence the female test group's voice gender preference has statistically significance difference. In terms of the individualization of voice type with different voice characters, the average grades from male test group to expert, assistant, family and pet type is 4.75, 4.85, 3.9, 3.15, while male test group is 3.65, 4.40, 5.40, 4.41. Conclusion: Based on the experimental results and verification, voice interaction among the elderly patients in the home environment, family-female type is the best choice.


Perception ◽  
1991 ◽  
Vol 20 (6) ◽  
pp. 715-726 ◽  
Author(s):  
Piotr Jaśkowski

Temporal-order judgment was investigated for a pair of visual stimuli with different durations in order to check whether offset asynchrony can disturb the perception of the order/simultaneity of onset. In experiment 1 the point of subjective simultaneity was estimated by the method of adjustment. The difference in duration of the two stimuli in the pair was either 0 or 50 ms. It was found that the subject shifts the onset of the shorter stimulus towards the offset of the longer one to obtain a satisfying impression of simultaneity even though the subject was asked to ignore the events concerning the stimulus offset. In experiments 2 and 3 the method of constant stimulus was applied. Both experiments indicate that subjects, in spite of instruction, take into account the offset asynchrony in their judgment.


2021 ◽  
Author(s):  
Peter Loksa ◽  
Norbert Kopco

Background: Ventriloquism aftereffect (VAE), observed as a shift in the perceived locations of sounds after audiovisual stimulation, requires reference frame (RF) alignment since hearing and vision encode space in different RFs (head-centered, HC, vs. eye-centered, EC). Experimental studies examining the RF of VAE found inconsistent results: a mixture of HC and EC RFs was observed for VAE induced in the central region, while a predominantly HC RF was observed in the periphery. Here, a computational model examines these inconsistencies, as well as a newly observed EC adaptation induced by AV-aligned audiovisual stimuli. Methods: The model has two versions, each containing two additively combined components: a saccade-related component characterizing the adaptation in auditory-saccade responses, and auditory space representation adapted by ventriloquism signals either in the HC RF (HC version) or in a combination of HC and EC RFs (HEC version). Results: The HEC model performed better than the HC model in the main simulation considering all the data, while the HC model was more appropriate when only the AV-aligned adaptation data were simulated. Conclusion: Visual signals in a uniform mixed HC+EC RF are likely used to calibrate the auditory spatial representation, even after the EC-referenced auditory-saccade adaptation is accounted for.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0253130
Author(s):  
Nina Heins ◽  
Jennifer Pomp ◽  
Daniel S. Kluger ◽  
Stefan Vinbrüx ◽  
Ima Trempler ◽  
...  

Auditory and visual percepts are integrated even when they are not perfectly temporally aligned with each other, especially when the visual signal precedes the auditory signal. This window of temporal integration for asynchronous audiovisual stimuli is relatively well examined in the case of speech, while other natural action-induced sounds have been widely neglected. Here, we studied the detection of audiovisual asynchrony in three different whole-body actions with natural action-induced sounds–hurdling, tap dancing and drumming. In Study 1, we examined whether audiovisual asynchrony detection, assessed by a simultaneity judgment task, differs as a function of sound production intentionality. Based on previous findings, we expected that auditory and visual signals should be integrated over a wider temporal window for actions creating sounds intentionally (tap dancing), compared to actions creating sounds incidentally (hurdling). While percentages of perceived synchrony differed in the expected way, we identified two further factors, namely high event density and low rhythmicity, to induce higher synchrony ratings as well. Therefore, we systematically varied event density and rhythmicity in Study 2, this time using drumming stimuli to exert full control over these variables, and the same simultaneity judgment tasks. Results suggest that high event density leads to a bias to integrate rather than segregate auditory and visual signals, even at relatively large asynchronies. Rhythmicity had a similar, albeit weaker effect, when event density was low. Our findings demonstrate that shorter asynchronies and visual-first asynchronies lead to higher synchrony ratings of whole-body action, pointing to clear parallels with audiovisual integration in speech perception. Overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. More studies using real-life audiovisual stimuli with varying event densities and rhythmicities are needed to fully uncover the general mechanisms of audiovisual integration.


2021 ◽  
Vol 10 (2) ◽  
Author(s):  
Claire Luo ◽  
Olivia Yeroushalmi ◽  
Alan Schorn

The original study of the McGurk Effect, a perceptual phenomenon caused by contradictory audiovisual stimuli fusing together to create the illusion of a third sound, was carried out by psychologists McGurk and MacDonald in 1976. The results of early experiments displayed that observers used both auditory and visual signals while being spoken to, auditory signals being the sound waves entering their ears, and visual signals being how the speaker moved his face while pronouncing a word. When conflicting signals are given, a third sound is perceived, as the brain is disoriented from the different signals. The idea that musicians have superior audiovisual cortexes have led some to speculate if musicians are as susceptible to the McGurk Effect as non-musicians. To research the susceptibility of musicians to the McGurk Effect, the experiment conducted included a total of 40 subjects, 20 musicians and 20 non-musicians. The subjects were played a control video of a speaker saying “ga” and were then presented with four audiovisually incongruent videos, all containing a speaker mouthing the word “ga” with the audio recording of the speaker saying “ba” dubbed on. Two main 2x2 Chi Square tests and fifteen secondary 2x2 Chi Squares tests were run in total. The two main tests, which compared the amount of McGurk interpretations to either audio or visual interpretations, both produced a p-value of <.0005. Upon further research, 25.7% of musicians reported a McGurk interpretation, as opposed to 52.2% of non-musicians, which implied that musicians are less susceptible to the McGurk effect.


Sign in / Sign up

Export Citation Format

Share Document