The role of visual speech cues in reducing energetic and informational masking

2005 ◽  
Vol 117 (2) ◽  
pp. 842-849 ◽  
Author(s):  
Karen S. Helfer ◽  
Richard L. Freyman
2019 ◽  
Author(s):  
Jonathan Henry Venezia ◽  
Robert Sandlin ◽  
Leon Wojno ◽  
Anthony Duc Tran ◽  
Gregory Hickok ◽  
...  

Static and dynamic visual speech cues contribute to audiovisual (AV) speech recognition in noise. Static cues (e.g., “lipreading”) provide complementary information that enables perceivers to ascertain ambiguous acoustic-phonetic content. The role of dynamic cues is less clear, but one suggestion is that temporal covariation between facial motion trajectories and the speech envelope enables perceivers to recover a more robust representation of the time-varying acoustic signal. Modeling studies show this is computationally feasible, though it has not been confirmed experimentally. We conducted two experiments to determine whether AV speech recognition depends on the magnitude of cross-sensory temporal coherence (AVC). In Experiment 1, sentence-keyword recognition in steady-state noise (SSN) was assessed across a range of signal-to-noise ratios (SNRs) for auditory and AV speech. The auditory signal was unprocessed or filtered to remove 3-7 Hz temporal modulations. Filtering severely reduced AVC (magnitude-squared coherence of lip trajectories with cochlear-narrowband speech envelopes), but did not reduce the magnitude of the AV advantage (AV > A; ~ 4 dB). This did not depend on the presence of static cues, manipulated via facial blurring. Experiment 2 assessed AV speech recognition in SSN at a fixed SNR (-10.5 dB) for subsets of Exp. 1 stimuli with naturally high or low AVC. A small effect (~ 5% correct; high-AVC > low-AVC) was observed. A computational model of AV speech intelligibility based on AVC yielded good overall predictions of performance, but over-predicted the differential effects of AVC. These results suggest the role and/or computational characterization of AVC must be re-conceptualized.


1997 ◽  
Vol 40 (2) ◽  
pp. 432-443 ◽  
Author(s):  
Karen S. Helfer

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.


2017 ◽  
Vol 88 (6) ◽  
pp. 2043-2059 ◽  
Author(s):  
Mélanie Havy ◽  
Afra Foroud ◽  
Laurel Fais ◽  
Janet F. Werker
Keyword(s):  

2018 ◽  
Vol 144 (3) ◽  
pp. 1799-1799
Author(s):  
Madeline Petrich ◽  
Macie Petrich ◽  
Chao-Yang Lee ◽  
Seth Wiener ◽  
Margaret Harrison ◽  
...  

1996 ◽  
Vol 39 (2) ◽  
pp. 228-238 ◽  
Author(s):  
Ken W. Grant ◽  
Brian E. Walden

Prosodic speech cues for rhythm, stress, and intonation are related primarily to variations in intensity, duration, and fundamental frequency. Because these cues make use of temporal properties of the speech waveform they are likely to be represented broadly across the speech spectrum. In order to determine the relative importance of different frequency regions for the recognition of Prosodic cues, identification of four Prosodic features, syllable number, syllabic stress, sentence intonation, and phrase boundary location, was evaluated under six filter conditions spanning the range from 200–6100 Hz. Each filter condition had equal articulation index (Al) weights, Al ½ 0.10; p(C) isolated words ≈ 0.40. Results obtained with normally hearing subjects showed that there was an interaction between filter condition and the identification of specific Prosodic features. For example, information from high-frequency regions of speech was particularly useful in the identification of syllable number and stress, whereas information from low-frequency regions was helpful in identifying intonation patterns. In spite of these spectral differences, overall listeners performed remarkably well in identifying Prosodic patterns, although individual differences were apparent. For some subjects, equivalent levels of performance across the six filter conditions were achieved. These results are discussed in relation to auditory and auditory-visual speech recognition.


Sign in / Sign up

Export Citation Format

Share Document