scholarly journals The Role of Multisensory Temporal Covariation in Audiovisual Speech Recognition in Noise

2019 ◽  
Author(s):  
Jonathan Henry Venezia ◽  
Robert Sandlin ◽  
Leon Wojno ◽  
Anthony Duc Tran ◽  
Gregory Hickok ◽  
...  

Static and dynamic visual speech cues contribute to audiovisual (AV) speech recognition in noise. Static cues (e.g., “lipreading”) provide complementary information that enables perceivers to ascertain ambiguous acoustic-phonetic content. The role of dynamic cues is less clear, but one suggestion is that temporal covariation between facial motion trajectories and the speech envelope enables perceivers to recover a more robust representation of the time-varying acoustic signal. Modeling studies show this is computationally feasible, though it has not been confirmed experimentally. We conducted two experiments to determine whether AV speech recognition depends on the magnitude of cross-sensory temporal coherence (AVC). In Experiment 1, sentence-keyword recognition in steady-state noise (SSN) was assessed across a range of signal-to-noise ratios (SNRs) for auditory and AV speech. The auditory signal was unprocessed or filtered to remove 3-7 Hz temporal modulations. Filtering severely reduced AVC (magnitude-squared coherence of lip trajectories with cochlear-narrowband speech envelopes), but did not reduce the magnitude of the AV advantage (AV > A; ~ 4 dB). This did not depend on the presence of static cues, manipulated via facial blurring. Experiment 2 assessed AV speech recognition in SSN at a fixed SNR (-10.5 dB) for subsets of Exp. 1 stimuli with naturally high or low AVC. A small effect (~ 5% correct; high-AVC > low-AVC) was observed. A computational model of AV speech intelligibility based on AVC yielded good overall predictions of performance, but over-predicted the differential effects of AVC. These results suggest the role and/or computational characterization of AVC must be re-conceptualized.

Author(s):  
Julie Beadle ◽  
Jeesun Kim ◽  
Chris Davis

Purpose: Listeners understand significantly more speech in noise when the talker's face can be seen (visual speech) in comparison to an auditory-only baseline (a visual speech benefit). This study investigated whether the visual speech benefit is reduced when the correspondence between auditory and visual speech is uncertain and whether any reduction is affected by listener age (older vs. younger) and how severe the auditory signal is masked. Method: Older and younger adults completed a speech recognition in noise task that included an auditory-only condition and four auditory–visual (AV) conditions in which one, two, four, or six silent talking face videos were presented. One face always matched the auditory signal; the other face(s) did not. Auditory speech was presented in noise at −6 and −1 dB signal-to-noise ratio (SNR). Results: When the SNR was −6 dB, for both age groups, the standard-sized visual speech benefit reduced as more talking faces were presented. When the SNR was −1 dB, younger adults received the standard-sized visual speech benefit even when two talking faces were presented, whereas older adults did not. Conclusions: The size of the visual speech benefit obtained by older adults was always smaller when AV correspondence was uncertain; this was not the case for younger adults. Difficulty establishing AV correspondence may be a factor that limits older adults' speech recognition in noisy AV environments. Supplemental Material https://doi.org/10.23641/asha.16879549


2014 ◽  
Vol 22 (4) ◽  
pp. 1048-1053 ◽  
Author(s):  
Nancy Tye-Murray ◽  
Brent P. Spehar ◽  
Joel Myerson ◽  
Sandra Hale ◽  
Mitchell S. Sommers

2018 ◽  
Vol 144 (3) ◽  
pp. 1799-1799
Author(s):  
Madeline Petrich ◽  
Macie Petrich ◽  
Chao-Yang Lee ◽  
Seth Wiener ◽  
Margaret Harrison ◽  
...  

Author(s):  
Brandi Jett ◽  
Emily Buss ◽  
Virginia Best ◽  
Jacob Oleson ◽  
Lauren Calandruccio

Purpose Three experiments were conducted to better understand the role of between-word coarticulation in masked speech recognition. Specifically, we explored whether naturally coarticulated sentences supported better masked speech recognition as compared to sentences derived from individually spoken concatenated words. We hypothesized that sentence recognition thresholds (SRTs) would be similar for coarticulated and concatenated sentences in a noise masker but would be better for coarticulated sentences in a speech masker. Method Sixty young adults participated ( n = 20 per experiment). An adaptive tracking procedure was used to estimate SRTs in the presence of noise or two-talker speech maskers. Targets in Experiments 1 and 2 were matrix-style sentences, while targets in Experiment 3 were semantically meaningful sentences. All experiments included coarticulated and concatenated targets; Experiments 2 and 3 included a third target type, concatenated keyword-intensity–matched (KIM) sentences, in which the words were concatenated but individually scaled to replicate the intensity contours of the coarticulated sentences. Results Regression analyses evaluated the main effects of target type, masker type, and their interaction. Across all three experiments, effects of target type were small (< 2 dB). In Experiment 1, SRTs were slightly poorer for coarticulated than concatenated sentences. In Experiment 2, coarticulation facilitated speech recognition compared to the concatenated KIM condition. When listeners had access to semantic context (Experiment 3), a coarticulation benefit was observed in noise but not in the speech masker. Conclusions Overall, differences between SRTs for sentences with and without between-word coarticulation were small. Beneficial effects of coarticulation were only observed relative to the concatenated KIM targets; for unscaled concatenated targets, it appeared that consistent audibility across the sentence offsets any benefit of coarticulation. Contrary to our hypothesis, effects of coarticulation generally were not more pronounced in speech maskers than in noise maskers.


2021 ◽  
Vol 15 ◽  
Author(s):  
Luuk P. H. van de Rijt ◽  
A. John van Opstal ◽  
Marc M. van Wanrooij

The cochlear implant (CI) allows profoundly deaf individuals to partially recover hearing. Still, due to the coarse acoustic information provided by the implant, CI users have considerable difficulties in recognizing speech, especially in noisy environments. CI users therefore rely heavily on visual cues to augment speech recognition, more so than normal-hearing individuals. However, it is unknown how attention to one (focused) or both (divided) modalities plays a role in multisensory speech recognition. Here we show that unisensory speech listening and reading were negatively impacted in divided-attention tasks for CI users—but not for normal-hearing individuals. Our psychophysical experiments revealed that, as expected, listening thresholds were consistently better for the normal-hearing, while lipreading thresholds were largely similar for the two groups. Moreover, audiovisual speech recognition for normal-hearing individuals could be described well by probabilistic summation of auditory and visual speech recognition, while CI users were better integrators than expected from statistical facilitation alone. Our results suggest that this benefit in integration comes at a cost. Unisensory speech recognition is degraded for CI users when attention needs to be divided across modalities. We conjecture that CI users exhibit an integration-attention trade-off. They focus solely on a single modality during focused-attention tasks, but need to divide their limited attentional resources in situations with uncertainty about the upcoming stimulus modality. We argue that in order to determine the benefit of a CI for speech recognition, situational factors need to be discounted by presenting speech in realistic or complex audiovisual environments.


2021 ◽  
Vol 7 ◽  
Author(s):  
Anna Warzybok ◽  
Jan Rennies ◽  
Birger Kollmeier

Masking noise and reverberation strongly influence speech intelligibility and decrease listening comfort. To optimize acoustics for ensuring a comfortable environment, it is crucial to understand the respective contribution of bottom-up signal-driven cues and top-down linguistic-semantic cues to speech recognition in noise and reverberation. Since the relevance of these cues differs across speech test materials and training status of the listeners, we investigate the influence of speech material type on speech recognition in noise, reverberation and combinations of noise and reverberation. We also examine the influence of training on the performance for a subset of measurement conditions. Speech recognition is measured with an open-set, everyday Plomp-type sentence test and compared to the recognition scores for a closed-set Matrix-type test consisting of syntactically fixed and semantically unpredictable sentences (c.f. data by Rennies et al., J. Acoust. Soc. America, 2014, 136, 2642–2653). While both tests yield approximately the same recognition threshold in noise in trained normal-hearing listeners, their performance may differ as a result of cognitive factors, i.e., the closed-set test is more sensitive to training effects while the open-set test is more affected by language familiarity. All experimental data were obtained at a fixed signal-to-noise ratio (SNR) and/or reverberation time set to obtain the desired speech transmission index (STI) values of 0.17, 0.30, and 0.43. respectively, thus linking the data to STI predictions as a measure of pure low-level acoustic effects. The results confirm the consistent difference between robustness to reverberation observed in the literature between the matrix type sentences and the Plomp-type sentences, especially for poor and medium speech intelligibility. The robustness of the closed-set matrix type sentences against reverberation disappeared when listeners had no a priori knowledge about the speech material (sentence structure and words used), thus demonstrating the influence of higher-level lexical-semantic cues in speech recognition. In addition, the consistent difference between reverberation- and noise-induced recognition scores of everyday sentences for medium and high STI conditions and the differences between Matrix-type and Plomp-type sentence scores clearly demonstrate the limited utility of the STI in predicting speech recognition in noise and reverberation.


2019 ◽  
Vol 62 (10) ◽  
pp. 3860-3875 ◽  
Author(s):  
Kaylah Lalonde ◽  
Lynne A. Werner

Purpose This study assessed the extent to which 6- to 8.5-month-old infants and 18- to 30-year-old adults detect and discriminate auditory syllables in noise better in the presence of visual speech than in auditory-only conditions. In addition, we examined whether visual cues to the onset and offset of the auditory signal account for this benefit. Method Sixty infants and 24 adults were randomly assigned to speech detection or discrimination tasks and were tested using a modified observer-based psychoacoustic procedure. Each participant completed 1–3 conditions: auditory-only, with visual speech, and with a visual signal that only cued the onset and offset of the auditory syllable. Results Mixed linear modeling indicated that infants and adults benefited from visual speech on both tasks. Adults relied on the onset–offset cue for detection, but the same cue did not improve their discrimination. The onset–offset cue benefited infants for both detection and discrimination. Whereas the onset–offset cue improved detection similarly for infants and adults, the full visual speech signal benefited infants to a lesser extent than adults on the discrimination task. Conclusions These results suggest that infants' use of visual onset–offset cues is mature, but their ability to use more complex visual speech cues is still developing. Additional research is needed to explore differences in audiovisual enhancement (a) of speech discrimination across speech targets and (b) with increasingly complex tasks and stimuli.


2002 ◽  
Vol 116 (S28) ◽  
pp. 47-51 ◽  
Author(s):  
Sunil N. Dutt ◽  
Ann-Louise McDermott ◽  
Stuart P. Burrell ◽  
Huw R. Cooper ◽  
Andrew P. Reid ◽  
...  

The Birmingham bone-anchored hearing aid (BAHA) programme, since its inception in 1988, has fitted more than 300 patients with unilateral bone-anchored hearing aids. Recently, some of the patients who benefited extremely well with unilateral aids applied for bilateral amplification. To date, 15 patients have been fitted with bilateral BAHAs. The benefits of bilateral amplification have been compared to unilateral amplification in 11 of these patients who have used their second BAHA for 12 months or longer. Following a subjective analysis in the form of comprehensive questionnaires, objective testing was undertaken to assess specific issues such as ‘speech recognition in quiet’, ‘speech recognition in noise’ and a modified ‘speech-in-simulated-party-noise’ (Plomp) test.‘Speech in quiet’ testing revealed a 100 per cent score with both unilateral and bilateral BAHAs. With ‘speech in noise’ all 11 patients scored marginally better with bilateral aids compared to best unilateral responses. The modified Plomp test demonstrated that bilateral BAHAs provided maximum flexibility when the origin of noise cannot be controlled as in day-to-day situations. In this small case series the results are positive and are comparable to the experience of the Nijmegen BAHA group.


Sign in / Sign up

Export Citation Format

Share Document