scholarly journals Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers

2019 ◽  
Vol 23 ◽  
pp. 233121651983786 ◽  
Author(s):  
Catherine L. Blackburn ◽  
Pádraig T. Kitterick ◽  
Gary Jones ◽  
Christian J. Sumner ◽  
Paula C. Stacey

Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

1996 ◽  
Vol 39 (6) ◽  
pp. 1159-1170 ◽  
Author(s):  
Lawrence D. Rosenblum ◽  
Jennifer A. Johnson ◽  
Helena M. Saldaña

Seeing a talker's face can improve the perception of speech in noise. There is little known about which characteristics of the face are useful for enhancing the degraded signal. In this study, a point-light technique was employed to help isolate the salient kinematic aspects of a visible articulating face. In this technique, fluorescent dots were arranged on the lips, teeth, tongue, cheeks, and jaw of an actor. The actor was videotaped speaking in the dark, so that when shown to observers, only the moving dots were seen. To test whether these reduced images could contribute to the perception of degraded speech, noise-embedded sentences were dubbed with the point-light images at various signal-to-noise ratios. It was found that these images could significantly improve comprehension for adults with normal hearing and that the images became more effective as participants gained experience with the stimuli. These results have implications for uncovering salient visual speech information as well as in the development of telecommunication systems for listeners who are hearing impaired.


2020 ◽  
Author(s):  
Anthony Trotter ◽  
Briony Banks ◽  
patti adank

The ability to quickly adapt to distorted speech signals, such as noise-vocoding, is one of the mechanisms listeners employ to understand one another in challenging listening conditions. In addition, listeners have the ability to exploit information offered by visual aspects of speech, and being able to see the speaker’s face while perceiving distorted speech improves perception of and adaptation to these distorted speech signals. However, it is unclear how important viewing specific parts of the speaker’s face is to the successful use of visual speech information – particularly, does looking at the speaker’s mouth specifically improve recognition of noise-vocoded speech, or is it equally effective to view the speaker’s entire face? This study aimed to establish whether viewing specific parts of the speaker’s face (eyes or mouth), compared to viewing the whole face, affected perception of and adaptation to distorted sentences. In a secondary aim, we wanted to establish whether it was possible to replicate results on processing of noise-vocoded speech from lab-based experiments in an online setting. We monitored speech recognition accuracy online while participants were listening to noise-vocoded sentences in a between-subjects design with five groups. We first established if participants were able to reliably perceive and adapt to audiovisual noise-vocoded sentences when the speaker’s whole face was visible (AV Full). Four further groups were tested: a group in which participants could only view the moving lower part of the speaker’s face – i.e., the mouth (AV Mouth), only see the moving upper part of the face (AV Eyes), a group in which participants could not see the speaker’s moving lower or upper face (AV Blocked), and a group in which they were presented with an image of a still face (AV Still). Participants repeated around 40% of key words correctly for the noise-vocoded sentences and adapted over the course of the experiment but only when the moving mouth was visible (AV Full and AV mouth). In contrast, performance was at floor level and no adaptation took place in conditions when the moving mouth was not visible (AV Blocked, AV Eyes, and AV Still). Our results show the importance of being able to observe relevant visual speech information from the speaker’s mouth region, but not the eyes/upper face region when listening and adapting to speech under challenging conditions online. Second, our results also demonstrated that it is feasible to run speech perception and adaptation studies online, but that not all findings reported for lab studies necessarily replicate.


2020 ◽  
Vol 31 (1) ◽  
pp. 591-602
Author(s):  
Qingqing Meng ◽  
Yiwen Li Hegner ◽  
Iain Giblin ◽  
Catherine McMahon ◽  
Blake W Johnson

Abstract Human cortical activity measured with magnetoencephalography (MEG) has been shown to track the temporal regularity of linguistic information in connected speech. In the current study, we investigate the underlying neural sources of these responses and test the hypothesis that they can be directly modulated by changes in speech intelligibility. MEG responses were measured to natural and spectrally degraded (noise-vocoded) speech in 19 normal hearing participants. Results showed that cortical coherence to “abstract” linguistic units with no accompanying acoustic cues (phrases and sentences) were lateralized to the left hemisphere and changed parametrically with intelligibility of speech. In contrast, responses coherent to words/syllables accompanied by acoustic onsets were bilateral and insensitive to intelligibility changes. This dissociation suggests that cerebral responses to linguistic information are directly affected by intelligibility but also powerfully shaped by physical cues in speech. This explains why previous studies have reported widely inconsistent effects of speech intelligibility on cortical entrainment and, within a single experiment, provided clear support for conclusions about language lateralization derived from a large number of separately conducted neuroimaging studies. Since noise-vocoded speech resembles the signals provided by a cochlear implant device, the current methodology has potential clinical utility for assessment of cochlear implant performance.


Author(s):  
Faizah Mushtaq ◽  
Ian M. Wiggins ◽  
Pádraig T. Kitterick ◽  
Carly A. Anderson ◽  
Douglas E. H. Hartley

AbstractWhilst functional neuroimaging has been used to investigate cortical processing of degraded speech in adults, much less is known about how these signals are processed in children. An enhanced understanding of cortical correlates of poor speech perception in children would be highly valuable to oral communication applications, including hearing devices. We utilised vocoded speech stimuli to investigate brain responses to degraded speech in 29 normally hearing children aged 6–12 years. Intelligibility of the speech stimuli was altered in two ways by (i) reducing the number of spectral channels and (ii) reducing the amplitude modulation depth of the signal. A total of five different noise-vocoded conditions (with zero, partial or high intelligibility) were presented in an event-related format whilst participants underwent functional near-infrared spectroscopy (fNIRS) neuroimaging. Participants completed a word recognition task during imaging, as well as a separate behavioural speech perception assessment. fNIRS recordings revealed statistically significant sensitivity to stimulus intelligibility across several brain regions. More intelligible stimuli elicited stronger responses in temporal regions, predominantly within the left hemisphere, while right inferior parietal regions showed an opposite, negative relationship. Although there was some evidence that partially intelligible stimuli elicited the strongest responses in the left inferior frontal cortex, a region previous studies have suggested is associated with effortful listening in adults, this effect did not reach statistical significance. These results further our understanding of cortical mechanisms underlying successful speech perception in children. Furthermore, fNIRS holds promise as a clinical technique to help assess speech intelligibility in paediatric populations.


2021 ◽  
Vol 11 (11) ◽  
pp. 4829
Author(s):  
Vojtech Chmelík ◽  
Daniel Urbán ◽  
Lukáš Zelem ◽  
Monika Rychtáriková

In this paper, with the aim of assessing the deterioration of speech intelligibility caused by a speaker wearing a mask, different face masks (surgical masks, FFP2 mask, homemade textile-based protection and two kinds of plastic shields) are compared in terms of their acoustic filtering effect, measured by placing the mask on an artificial head/mouth simulator. For investigating the additional effects on the speaker’s vocal output, speech was also recorded while people were reading a text when wearing a mask, and without a mask. In order to discriminate between effects of acoustic filtering by the mask and mask-induced effects of vocal output changes, the latter was monitored by measuring vibrations at the suprasternal notch, using an attached accelerometer. It was found that when wearing a mask, people tend to slightly increase their voice level, while when wearing plastic face shield, they reduce their vocal power. Unlike the Lombard effect, no significant change was found in the spectral content. All face mask and face shields attenuate frequencies above 1–2 kHz. In addition, plastic shields also increase frequency components to around 800 Hz, due to resonances occurring between the face and the shield. Finally, special attention was given to the Slavic languages, in particular Slovak, which contain a large variety of sibilants. Male and female speech, as well as texts with and without sibilants, was compared.


2017 ◽  
Vol 92 ◽  
pp. 114-124
Author(s):  
Sarah E. Fenwick ◽  
Catherine T. Best ◽  
Chris Davis ◽  
Michael D. Tyler

1997 ◽  
Vol 40 (2) ◽  
pp. 432-443 ◽  
Author(s):  
Karen S. Helfer

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.


2018 ◽  
Vol 37 (2) ◽  
pp. 159 ◽  
Author(s):  
Fatemeh Vakhshiteh ◽  
Farshad Almasganj ◽  
Ahmad Nickabadi

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.


Sign in / Sign up

Export Citation Format

Share Document