scholarly journals The Impact of Temporally Coherent Visual Cues on Speech Perception in Complex Auditory Environments

2021 ◽  
Vol 15 ◽  
Author(s):  
Yi Yuan ◽  
Yasneli Lleo ◽  
Rebecca Daniel ◽  
Alexandra White ◽  
Yonghee Oh

Speech perception often takes place in noisy environments, where multiple auditory signals compete with one another. The addition of visual cues such as talkers’ faces or lip movements to an auditory signal can help improve the intelligibility of speech in those suboptimal listening environments. This is referred to as audiovisual benefits. The current study aimed to delineate the signal-to-noise ratio (SNR) conditions under which visual presentations of the acoustic amplitude envelopes have their most significant impact on speech perception. Seventeen adults with normal hearing were recruited. Participants were presented with spoken sentences in babble noise either in auditory-only or auditory-visual conditions with various SNRs at −7, −5, −3, −1, and 1 dB. The visual stimulus applied in this study was a sphere that varied in size syncing with the amplitude envelope of the target speech signals. Participants were asked to transcribe the sentences they heard. Results showed that a significant improvement in accuracy in the auditory-visual condition versus the audio-only condition was obtained at the SNRs of −3 and −1 dB, but no improvement was observed in other SNRs. These results showed that dynamic temporal visual information can benefit speech perception in noise, and the optimal facilitative effects of visual amplitude envelope can be observed under an intermediate SNR range.

Author(s):  
Lorenzo Cangiano ◽  
Sabrina Asteriti

AbstractIn the vertebrate retina, signals generated by cones of different spectral preference and by highly sensitive rod photoreceptors interact at various levels to extract salient visual information. The first opportunity for such interaction is offered by electrical coupling of the photoreceptors themselves, which is mediated by gap junctions located at the contact points of specialised cellular processes: synaptic terminals, telodendria and radial fins. Here, we examine the evolutionary pressures for and against interphotoreceptor coupling, which are likely to have shaped how coupling is deployed in different species. The impact of coupling on signal to noise ratio, spatial acuity, contrast sensitivity, absolute and increment threshold, retinal signal flow and colour discrimination is discussed while emphasising available data from a variety of vertebrate models spanning from lampreys to primates. We highlight the many gaps in our knowledge, persisting discrepancies in the literature, as well as some major unanswered questions on the actual extent and physiological role of cone-cone, rod-cone and rod-rod communication. Lastly, we point toward limited but intriguing evidence suggestive of the ancestral form of coupling among ciliary photoreceptors.


2019 ◽  
Vol 27 (1) ◽  
pp. 47-59 ◽  
Author(s):  
Flavia Gheller ◽  
Elisa Lovo ◽  
Athena Arsie ◽  
Roberto Bovo

The acoustic quality of classrooms is crucial for children’s listening skills and consequently for their learning. Listening abilities in kids are still developing, and an environment with inadequate acoustic characteristics may create additional problems in speech perception and phonetic recognition. Background noise or reverberation may cause auditory processing problems and greater cognitive effort. There are also other elements which can make difficulty in listening and understanding in noisy environments an even more serious problem, such as learning disabilities, mild to severe hearing loss or bilingualism. Therefore, it is important to improve the acoustic quality of the classrooms, taking into account the specific needs of children in terms of signal-to-noise ratio and reverberation time, in order to ensure a proper quality of listening. The aim of this work is to analyse, through the review of previous studies, the impact that the acoustic of classrooms has on children’s listening skills and learning activities.


2021 ◽  
Author(s):  
Hoyoung Yi ◽  
Ashly Pingsterhaus ◽  
Woonyoung Song

The coronavirus pandemic has resulted in recommended/required use of a face mask in public. The use of a face mask compromises communication, especially in the presence of competing noise. It is crucial to measure potential adverse effect(s) of wearing face masks on speech intelligibility in communication contexts where excessive background noise occurs to lead to solutions for this communication challenge. Accordingly, effects of wearing transparent face masks and using clear speech to support better verbal communication was evaluated here. We evaluated listener word identification scores in the following four conditions: (1) type of masking (i.e., no mask, transparent mask, and disposable paper mask), (2) presentation mode (i.e., auditory only and audiovisual), (3) speaker speaking style (i.e., conversational speech and clear speech), and (4) with two types of background noise (i.e., speech shaped noise and four-talker babble at negative 5 signal to noise ratio levels). Results showed that in the presence of noise, listeners performed less well when the speaker wore a disposable paper mask or a transparent mask compared to wearing no mask. Listeners correctly identified more words in the audiovisual when listening to clear speech. Results indicate the combination of face masks and the presence of background noise impact speech intelligibility negatively for listeners. Transparent masks facilitate the ability to understand target sentences by providing visual information. Use of clear speech was shown to alleviate challenging communication situations including lack of visual cues and reduced acoustic signal.


2007 ◽  
Vol 44 (5) ◽  
pp. 518-522 ◽  
Author(s):  
Shelley Von Berg ◽  
Douglas McColl ◽  
Tami Brancamp

Objective: This study investigated observers’ intelligibility for the spoken output of an individual with Moebius syndrome (MoS) with and without visual cues. Design: An audiovisual recording of the speaker's output was obtained for 50 Speech Intelligibility in Noise sentences consisting of 25 high predictability and 25 low predictability sentences. Stimuli were presented to observers under two conditions: audiovisual and audio only. Data were analyzed using a multivariate repeated measures model. Observers: Twenty students and faculty affiliated with the Department of Speech Pathology and Audiology at the University of Nevada, Reno. Results: ANOVA mixed design revealed that intelligibility for the audio condition only was significantly greater than intelligibility for the audiovisual condition; and accuracy for high predictability sentences was significantly greater than accuracy for low predictability sentences. Conclusions: The compensatory substitutional placements for phonemes produced by MoS speakers may detract from the intelligibility of speech. This is similar to the McGurk-MacDonald effect, whereby an illusory auditory signal is perceived when visual information from lip movements does not match the auditory information from speech. It also suggests that observers use contextual clues, more than the acoustic signal alone, to arrive at the accurate recognition of the message of the speakers with MoS. Therefore, speakers with MoS should be counseled in the top-down approach of auditory closure. When the speech signal is degraded, predictable messages are more easily understood than unpredictable ones. It is also important to confirm the speaking partner's understanding of the topic before proceeding.


2021 ◽  
Vol 42 (03) ◽  
pp. 260-281
Author(s):  
Asger Heidemann Andersen ◽  
Sébastien Santurette ◽  
Michael Syskind Pedersen ◽  
Emina Alickovic ◽  
Lorenz Fiedler ◽  
...  

AbstractHearing aids continue to acquire increasingly sophisticated sound-processing features beyond basic amplification. On the one hand, these have the potential to add user benefit and allow for personalization. On the other hand, if such features are to benefit according to their potential, they require clinicians to be acquainted with both the underlying technologies and the specific fitting handles made available by the individual hearing aid manufacturers. Ensuring benefit from hearing aids in typical daily listening environments requires that the hearing aids handle sounds that interfere with communication, generically referred to as “noise.” With this aim, considerable efforts from both academia and industry have led to increasingly advanced algorithms that handle noise, typically using the principles of directional processing and postfiltering. This article provides an overview of the techniques used for noise reduction in modern hearing aids. First, classical techniques are covered as they are used in modern hearing aids. The discussion then shifts to how deep learning, a subfield of artificial intelligence, provides a radically different way of solving the noise problem. Finally, the results of several experiments are used to showcase the benefits of recent algorithmic advances in terms of signal-to-noise ratio, speech intelligibility, selective attention, and listening effort.


Author(s):  
Grant McGuire ◽  
Molly Babel

AbstractWhile the role of auditory saliency is well accepted as providing insight into the shaping of phonological systems, the influence of visual saliency on such systems has been neglected. This paper provides evidence for the importance of visual information in historical phonological change and synchronic variation through a series of audio-visual experiments with the /f/∼/θ/ contrast. /θ/ is typologically rare, an atypical target in sound change, acquired comparatively late, and synchronically variable in language inventories. Previous explanations for these patterns have focused on either the articulatory difficulty of an interdental tongue gesture or the perceptual similarity /θ/ shares with labiodental fricatives. We hypothesize that the bias is due to an asymmetry in audio-visual phonetic cues and cue variability within and across talkers. Support for this hypothesis comes from a speech perception study that explored the weighting of audio and visual cues for /f/ and /θ/ identification in CV, VC, and VCV syllabic environments in /i/, /a/, or /u/ vowel contexts in Audio, Visual, and Audio-Visual experimental conditions using stimuli from ten different talkers. The results indicate that /θ/ is more variable than /f/, both in Audio and Visual conditions. We propose that it is this variability which contributes to the unstable nature of /θ/ across time and offers an improved explanation for the observed synchronic and diachronic asymmetries in its patterning.


2012 ◽  
Vol 25 (0) ◽  
pp. 112 ◽  
Author(s):  
Lukasz Piwek ◽  
Karin Petrini ◽  
Frank E. Pollick

Multimodal perception of emotions has been typically examined using displays of a solitary character (e.g., the face–voice and/or body–sound of one actor). We extend investigation to more complex, dyadic point-light displays combined with speech. A motion and voice capture system was used to record twenty actors interacting in couples with happy, angry and neutral emotional expressions. The obtained stimuli were validated in a pilot study and used in the present study to investigate multimodal perception of emotional social interactions. Participants were required to categorize happy and angry expressions displayed visually, auditorily, or using emotionally congruent and incongruent bimodal displays. In a series of cross-validation experiments we found that sound dominated the visual signal in the perception of emotional social interaction. Although participants’ judgments were faster in the bimodal condition, the accuracy of judgments was similar for both bimodal and auditory-only conditions. When participants watched emotionally mismatched bimodal displays, they predominantly oriented their judgments towards the auditory rather than the visual signal. This auditory dominance persisted even when the reliability of auditory signal was decreased with noise, although visual information had some effect on judgments of emotions when it was combined with a noisy auditory signal. Our results suggest that when judging emotions from observed social interaction, we rely primarily on vocal cues from the conversation, rather then visual cues from their body movement.


2021 ◽  
Vol 130 ◽  
pp. 02001
Author(s):  
Marion Giroux ◽  
Julien Barra ◽  
Christian Graff ◽  
Michel Guerraz

In virtual reality, users do not receive any visual information coming from their own body. Thus, avatars are often used, and they can be embodied which alters the body representation. We suggested that the perception of one’s own movements (i.e., kinaesthesia) can be altered as well. We investigated whether visual cues coming from an avatar can be used for kinaesthesia and to what extent such cues can deviate from natural ones. We used a paradigm in which the participant’s left forearm was moved passively, correlated with the movement of both forearms of the avatar. Such visuo-proprioceptive combination induces kinaesthetic illusions in the participant’s right forearm. The impact of the morphological similarity (semantic congruency) and of the visual perspective of the avatar (spatial congruency) was investigated. Results have indicated that avatar’s movements are processed as one’s own movements. Morphological similarity and first-person perspective were not necessary, but they reinforced the illusions. Thus, visual motion cues can strongly deviate from natural ones in morphology and perspective and still contribute to kinaesthesia.


2021 ◽  
Vol 25 ◽  
pp. 233121652110141
Author(s):  
Anja Eichenauer ◽  
Uwe Baumann ◽  
Timo Stöver ◽  
Tobias Weissgerber

Clinical speech perception tests with simple presentation conditions often overestimate the impact of signal preprocessing on speech perception in complex listening environments. A new procedure was developed to assess speech perception in interleaved acoustic environments of different complexity that allows investigation of the impact of an automatic scene classification (ASC) algorithm on speech perception. The procedure was applied in cohorts of normal hearing (NH) controls and uni- and bilateral cochlear implant (CI) users. Speech reception thresholds (SRTs) were measured by means of a matrix sentence test in five acoustic environments that included different noise conditions (amplitude modulated and continuous), two spatial configurations, and reverberation. The acoustic environments were encapsulated in a randomized, mixed order single experimental run. Acoustic room simulation was played back with a loudspeaker auralization setup with 128 loudspeakers. 18 NH, 16 unilateral, and 16 bilateral CI users participated. SRTs were evaluated for each individual acoustic environment and as mean-SRT. Mean-SRTs improved by 2.4 dB signal-to-noise ratio for unilateral and 1.3 dB signal-to-noise ratio for bilateral CI users with activated ASC. Without ASC, the mean-SRT of bilateral CI users was 3.7 dB better than the SRT of unilateral CI users. The mean-SRT indicated significant differences, with NH group performing best and unilateral CI users performing worse with a difference of up to 13 dB compared to NH. The proposed speech test procedure successfully demonstrated that speech perception and benefit with ASC depend on the acoustic environment.


2018 ◽  
Vol 30 (6) ◽  
pp. 1573-1611 ◽  
Author(s):  
Matias Calderini ◽  
Sophie Zhang ◽  
Nareg Berberian ◽  
Jean-Philippe Thivierge

The neural correlates of decision making have been extensively studied with tasks involving a choice between two alternatives that is guided by visual cues. While a large body of work argues for a role of the lateral intraparietal (LIP) region of cortex in these tasks, this role may be confounded by the interaction between LIP and other regions, including medial temporal (MT) cortex. Here, we describe a simplified linear model of decision making that is adapted to two tasks: a motion discrimination and a categorization task. We show that the distinct contribution of MT and LIP may indeed be confounded in these tasks. In particular, we argue that the motion discrimination task relies on a straightforward visuomotor mapping, which leads to redundant information between MT and LIP. The categorization task requires a more complex mapping between visual information and decision behavior, and therefore does not lead to redundancy between MT and LIP. Going further, the model predicts that noise correlations within LIP should be greater in the categorization compared to the motion discrimination task due to the presence of shared inputs from MT. The impact of these correlations on task performance is examined by analytically deriving error estimates of an optimal linear readout for shared and unique inputs. Taken together, results clarify the contribution of MT and LIP to decision making and help characterize the role of noise correlations in these regions.


Sign in / Sign up

Export Citation Format

Share Document