Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers

Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

Download Full-text

Point-Light Facial Displays Enhance Comprehension of Speech in Noise

Journal of Speech Language and Hearing Research ◽

10.1044/jshr.3906.1159 ◽

1996 ◽

Vol 39 (6) ◽

pp. 1159-1170 ◽

Cited By ~ 99

Author(s):

Lawrence D. Rosenblum ◽

Jennifer A. Johnson ◽

Helena M. Saldaña

Keyword(s):

Visual Speech ◽

Signal To Noise ◽

Speech In Noise ◽

Degraded Speech ◽

Telecommunication Systems ◽

Facial Displays ◽

The Face ◽

Point Light ◽

Visual Speech Information ◽

Speech Information

Seeing a talker's face can improve the perception of speech in noise. There is little known about which characteristics of the face are useful for enhancing the degraded signal. In this study, a point-light technique was employed to help isolate the salient kinematic aspects of a visible articulating face. In this technique, fluorescent dots were arranged on the lips, teeth, tongue, cheeks, and jaw of an actor. The actor was videotaped speaking in the dark, so that when shown to observers, only the moving dots were seen. To test whether these reduced images could contribute to the perception of degraded speech, noise-embedded sentences were dubbed with the point-light images at various signal-to-noise ratios. It was found that these images could significantly improve comprehension for adults with normal hearing and that the images became more effective as participants gained experience with the stimuli. These results have implications for uncovering salient visual speech information as well as in the development of telecommunication systems for listeners who are hearing impaired.

Download Full-text

Visual‐speech intelligibility for syllables: A comparison of conversational and clear speech

The Journal of the Acoustical Society of America ◽

10.1121/1.413905 ◽

1995 ◽

Vol 98 (5) ◽

pp. 2983-2983

Author(s):

Jean‐Pierre Gagné ◽

Anne‐Josée Rochette

Keyword(s):

Speech Intelligibility ◽

Visual Speech ◽

Clear Speech

Download Full-text

Effects of the availability of visual cues during adaptation to noise - vocoded speech

10.31234/osf.io/jxaeb ◽

2020 ◽

Author(s):

Anthony Trotter ◽

Briony Banks ◽

patti adank

Keyword(s):

Visual Cues ◽

Visual Speech ◽

Speech Signals ◽

Adaptation Studies ◽

Face Region ◽

The Face ◽

Upper Face ◽

Visual Speech Information ◽

Speech Information ◽

Vocoded Speech

The ability to quickly adapt to distorted speech signals, such as noise-vocoding, is one of the mechanisms listeners employ to understand one another in challenging listening conditions. In addition, listeners have the ability to exploit information offered by visual aspects of speech, and being able to see the speaker’s face while perceiving distorted speech improves perception of and adaptation to these distorted speech signals. However, it is unclear how important viewing specific parts of the speaker’s face is to the successful use of visual speech information – particularly, does looking at the speaker’s mouth specifically improve recognition of noise-vocoded speech, or is it equally effective to view the speaker’s entire face? This study aimed to establish whether viewing specific parts of the speaker’s face (eyes or mouth), compared to viewing the whole face, affected perception of and adaptation to distorted sentences. In a secondary aim, we wanted to establish whether it was possible to replicate results on processing of noise-vocoded speech from lab-based experiments in an online setting. We monitored speech recognition accuracy online while participants were listening to noise-vocoded sentences in a between-subjects design with five groups. We first established if participants were able to reliably perceive and adapt to audiovisual noise-vocoded sentences when the speaker’s whole face was visible (AV Full). Four further groups were tested: a group in which participants could only view the moving lower part of the speaker’s face – i.e., the mouth (AV Mouth), only see the moving upper part of the face (AV Eyes), a group in which participants could not see the speaker’s moving lower or upper face (AV Blocked), and a group in which they were presented with an image of a still face (AV Still). Participants repeated around 40% of key words correctly for the noise-vocoded sentences and adapted over the course of the experiment but only when the moving mouth was visible (AV Full and AV mouth). In contrast, performance was at floor level and no adaptation took place in conditions when the moving mouth was not visible (AV Blocked, AV Eyes, and AV Still). Our results show the importance of being able to observe relevant visual speech information from the speaker’s mouth region, but not the eyes/upper face region when listening and adapting to speech under challenging conditions online. Second, our results also demonstrated that it is feasible to run speech perception and adaptation studies online, but that not all findings reported for lab studies necessarily replicate.

Download Full-text

Lateralized Cerebral Processing of Abstract Linguistic Structure in Clear and Degraded Speech

Cerebral Cortex ◽

10.1093/cercor/bhaa245 ◽

2020 ◽

Vol 31 (1) ◽

pp. 591-602

Author(s):

Qingqing Meng ◽

Yiwen Li Hegner ◽

Iain Giblin ◽

Catherine McMahon ◽

Blake W Johnson

Keyword(s):

Cochlear Implant ◽

Speech Intelligibility ◽

Single Experiment ◽

Linguistic Information ◽

Degraded Speech ◽

Potential Clinical Utility ◽

Cortical Entrainment ◽

Cerebral Processing ◽

Linguistic Units ◽

Vocoded Speech

Abstract Human cortical activity measured with magnetoencephalography (MEG) has been shown to track the temporal regularity of linguistic information in connected speech. In the current study, we investigate the underlying neural sources of these responses and test the hypothesis that they can be directly modulated by changes in speech intelligibility. MEG responses were measured to natural and spectrally degraded (noise-vocoded) speech in 19 normal hearing participants. Results showed that cortical coherence to “abstract” linguistic units with no accompanying acoustic cues (phrases and sentences) were lateralized to the left hemisphere and changed parametrically with intelligibility of speech. In contrast, responses coherent to words/syllables accompanied by acoustic onsets were bilateral and insensitive to intelligibility changes. This dissociation suggests that cerebral responses to linguistic information are directly affected by intelligibility but also powerfully shaped by physical cues in speech. This explains why previous studies have reported widely inconsistent effects of speech intelligibility on cortical entrainment and, within a single experiment, provided clear support for conclusions about language lateralization derived from a large number of separately conducted neuroimaging studies. Since noise-vocoded speech resembles the signals provided by a cochlear implant device, the current methodology has potential clinical utility for assessment of cochlear implant performance.

Download Full-text

Investigating Cortical Responses to Noise-Vocoded Speech in Children with Normal Hearing Using Functional Near-Infrared Spectroscopy (fNIRS)

Journal of the Association for Research in Otolaryngology ◽

10.1007/s10162-021-00817-z ◽

2021 ◽

Author(s):

Faizah Mushtaq ◽

Ian M. Wiggins ◽

Pádraig T. Kitterick ◽

Carly A. Anderson ◽

Douglas E. H. Hartley

Keyword(s):

Infrared Spectroscopy ◽

Speech Perception ◽

Near Infrared Spectroscopy ◽

Speech Intelligibility ◽

Near Infrared ◽

Negative Relationship ◽

Functional Near Infrared Spectroscopy ◽

Speech Stimuli ◽

Degraded Speech ◽

Vocoded Speech

AbstractWhilst functional neuroimaging has been used to investigate cortical processing of degraded speech in adults, much less is known about how these signals are processed in children. An enhanced understanding of cortical correlates of poor speech perception in children would be highly valuable to oral communication applications, including hearing devices. We utilised vocoded speech stimuli to investigate brain responses to degraded speech in 29 normally hearing children aged 6–12 years. Intelligibility of the speech stimuli was altered in two ways by (i) reducing the number of spectral channels and (ii) reducing the amplitude modulation depth of the signal. A total of five different noise-vocoded conditions (with zero, partial or high intelligibility) were presented in an event-related format whilst participants underwent functional near-infrared spectroscopy (fNIRS) neuroimaging. Participants completed a word recognition task during imaging, as well as a separate behavioural speech perception assessment. fNIRS recordings revealed statistically significant sensitivity to stimulus intelligibility across several brain regions. More intelligible stimuli elicited stronger responses in temporal regions, predominantly within the left hemisphere, while right inferior parietal regions showed an opposite, negative relationship. Although there was some evidence that partially intelligible stimuli elicited the strongest responses in the left inferior frontal cortex, a region previous studies have suggested is associated with effortful listening in adults, this effect did not reach statistical significance. These results further our understanding of cortical mechanisms underlying successful speech perception in children. Furthermore, fNIRS holds promise as a clinical technique to help assess speech intelligibility in paediatric populations.

Download Full-text

Effect of Mouth Mask and Face Shield on Speech Spectrum in Slovak Language

Applied Sciences ◽

10.3390/app11114829 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4829

Author(s):

Vojtech Chmelík ◽

Daniel Urbán ◽

Lukáš Zelem ◽

Monika Rychtáriková

Keyword(s):

Speech Intelligibility ◽

Face Mask ◽

Suprasternal Notch ◽

Lombard Effect ◽

Slavic Languages ◽

Female Speech ◽

The Face ◽

Frequency Components ◽

Acoustic Filtering ◽

Filtering Effect

In this paper, with the aim of assessing the deterioration of speech intelligibility caused by a speaker wearing a mask, different face masks (surgical masks, FFP2 mask, homemade textile-based protection and two kinds of plastic shields) are compared in terms of their acoustic filtering effect, measured by placing the mask on an artificial head/mouth simulator. For investigating the additional effects on the speaker’s vocal output, speech was also recorded while people were reading a text when wearing a mask, and without a mask. In order to discriminate between effects of acoustic filtering by the mask and mask-induced effects of vocal output changes, the latter was monitored by measuring vibrations at the suprasternal notch, using an attached accelerometer. It was found that when wearing a mask, people tend to slightly increase their voice level, while when wearing plastic face shield, they reduce their vocal power. Unlike the Lombard effect, no significant change was found in the spectral content. All face mask and face shields attenuate frequencies above 1–2 kHz. In addition, plastic shields also increase frequency components to around 800 Hz, due to resonances occurring between the face and the shield. Finally, special attention was given to the Slavic languages, in particular Slovak, which contain a large variety of sibilants. Male and female speech, as well as texts with and without sibilants, was compared.

Download Full-text

A Binaural Model Predicting Speech Intelligibility in the Presence of Stationary Noise and Noise-Vocoded Speech Interferers for Normal-Hearing and Hearing-Impaired Listeners

Acta Acustica united with Acustica ◽

10.3813/aaa.919243 ◽

2018 ◽

Vol 104 (5) ◽

pp. 909-913 ◽

Cited By ~ 3

Author(s):

Mathieu Lavandier ◽

Jörg M. Buchholz ◽

Baljeet Rana

Keyword(s):

Speech Intelligibility ◽

Normal Hearing ◽

Hearing Impaired ◽

Stationary Noise ◽

Vocoded Speech

Download Full-text

The influence of auditory-visual speech and clear speech on cross-language perceptual assimilation

Speech Communication ◽

10.1016/j.specom.2017.06.001 ◽

2017 ◽

Vol 92 ◽

pp. 114-124

Author(s):

Sarah E. Fenwick ◽

Catherine T. Best ◽

Chris Davis ◽

Michael D. Tyler

Keyword(s):

Visual Speech ◽

Clear Speech ◽

Cross Language

Download Full-text

Auditory and Auditory-Visual Perception of Clear and Conversational Speech

Journal of Speech Language and Hearing Research ◽

10.1044/jslhr.4002.432 ◽

1997 ◽

Vol 40 (2) ◽

pp. 432-443 ◽

Cited By ~ 82

Author(s):

Karen S. Helfer

Keyword(s):

Visual Cues ◽

Presentation Mode ◽

Visual Presentation ◽

Visual Speech ◽

Conversational Speech ◽

Clear Speech ◽

Nature Of Information ◽

Speech Cues ◽

Visual Speech Information ◽

Speech Information

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.

Download Full-text

LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Image Analysis & Stereology ◽

10.5566/ias.1859 ◽

2018 ◽

Vol 37 (2) ◽

pp. 159 ◽

Cited By ~ 2

Author(s):

Fatemeh Vakhshiteh ◽

Farshad Almasganj ◽

Ahmad Nickabadi

Keyword(s):

Speech Intelligibility ◽

Deep Neural Networks ◽

Visual Speech ◽

Visual Features ◽

Noisy Environments ◽

Phone Recognition ◽

Facial Information ◽

Visual Speech Recognition ◽

Lip Reading ◽

Reading System

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

Download Full-text