The role of combined consonant duration and amplitude processing on speech intelligibility in noise

2008 ◽  
Vol 123 (5) ◽  
pp. 3865-3865
Author(s):  
Jeffrey J. Digiovanni ◽  
Jessica A. Wolfanger
2021 ◽  
Author(s):  
Vibha Viswanathan ◽  
Barbara G. Shinn-Cunningham ◽  
Michael G. Heinz

To understand the mechanisms of speech perception in everyday listening environments, it is important to elucidate the relative contributions of different acoustics cues in transmitting phonetic content. Previous studies suggest that the energy envelopes of speech convey most speech content, while the temporal fine structure (TFS) can aid in segregating target speech from background noise. Despite the vast literature on TFS and speech intelligibility, the role of TFS in conveying additional speech content over what envelopes convey in complex acoustic scenes is poorly understood. The present study addresses this question using online psychophysical experiments to measure consonant identification in multi-talker babble for intelligibility-matched intact and 64-channel envelope-vocoded stimuli. Consonant confusion patterns revealed that listeners had a greater tendency in the vocoded (versus intact) condition to be biased towards reporting that they heard an unvoiced consonant, despite envelope and place cues being largely preserved. This result was replicated when babble instances were varied across independent experiments, suggesting that TFS conveys important voicing cues over what envelopes convey in multi-talker babble, a masker that is ubiquitous in everyday environments. This finding has implications for assistive listening devices that do not currently provide TFS cues, such as cochlear implants.


2019 ◽  
Vol 23 ◽  
pp. 233121651985459 ◽  
Author(s):  
Jan Rennies ◽  
Virginia Best ◽  
Elin Roverud ◽  
Gerald Kidd

Speech perception in complex sound fields can greatly benefit from different unmasking cues to segregate the target from interfering voices. This study investigated the role of three unmasking cues (spatial separation, gender differences, and masker time reversal) on speech intelligibility and perceived listening effort in normal-hearing listeners. Speech intelligibility and categorically scaled listening effort were measured for a female target talker masked by two competing talkers with no unmasking cues or one to three unmasking cues. In addition to natural stimuli, all measurements were also conducted with glimpsed speech—which was created by removing the time–frequency tiles of the speech mixture in which the maskers dominated the mixture—to estimate the relative amounts of informational and energetic masking as well as the effort associated with source segregation. The results showed that all unmasking cues as well as glimpsing improved intelligibility and reduced listening effort and that providing more than one cue was beneficial in overcoming informational masking. The reduction in listening effort due to glimpsing corresponded to increases in signal-to-noise ratio of 8 to 18 dB, indicating that a significant amount of listening effort was devoted to segregating the target from the maskers. Furthermore, the benefit in listening effort for all unmasking cues extended well into the range of positive signal-to-noise ratios at which speech intelligibility was at ceiling, suggesting that listening effort is a useful tool for evaluating speech-on-speech masking conditions at typical conversational levels.


2011 ◽  
Vol 53 (3) ◽  
pp. 327-339 ◽  
Author(s):  
Kuldip Paliwal ◽  
Belinda Schwerin ◽  
Kamil Wójcicki

2020 ◽  
Vol 10 (11) ◽  
pp. 810
Author(s):  
Stanley Shen ◽  
Jess R. Kerlin ◽  
Heather Bortfeld ◽  
Antoine J. Shahin

The efficacy of audiovisual (AV) integration is reflected in the degree of cross-modal suppression of the auditory event-related potentials (ERPs, P1-N1-P2), while stronger semantic encoding is reflected in enhanced late ERP negativities (e.g., N450). We hypothesized that increasing visual stimulus reliability should lead to more robust AV-integration and enhanced semantic prediction, reflected in suppression of auditory ERPs and enhanced N450, respectively. EEG was acquired while individuals watched and listened to clear and blurred videos of a speaker uttering intact or highly-intelligible degraded (vocoded) words and made binary judgments about word meaning (animate or inanimate). We found that intact speech evoked larger negativity between 280–527-ms than vocoded speech, suggestive of more robust semantic prediction for the intact signal. For visual reliability, we found that greater cross-modal ERP suppression occurred for clear than blurred videos prior to sound onset and for the P2 ERP. Additionally, the later semantic-related negativity tended to be larger for clear than blurred videos. These results suggest that the cross-modal effect is largely confined to suppression of early auditory networks with weak effect on networks associated with semantic prediction. However, the semantic-related visual effect on the late negativity may have been tempered by the vocoded signal’s high-reliability.


2017 ◽  
Vol 141 (2) ◽  
pp. EL170-EL176 ◽  
Author(s):  
Rachel L. Ellinger ◽  
Kasey M. Jakien ◽  
Frederick J. Gallun

Sign in / Sign up

Export Citation Format

Share Document