The role of combined consonant duration and amplitude processing on speech intelligibility in noise

To understand the mechanisms of speech perception in everyday listening environments, it is important to elucidate the relative contributions of different acoustics cues in transmitting phonetic content. Previous studies suggest that the energy envelopes of speech convey most speech content, while the temporal fine structure (TFS) can aid in segregating target speech from background noise. Despite the vast literature on TFS and speech intelligibility, the role of TFS in conveying additional speech content over what envelopes convey in complex acoustic scenes is poorly understood. The present study addresses this question using online psychophysical experiments to measure consonant identification in multi-talker babble for intelligibility-matched intact and 64-channel envelope-vocoded stimuli. Consonant confusion patterns revealed that listeners had a greater tendency in the vocoded (versus intact) condition to be biased towards reporting that they heard an unvoiced consonant, despite envelope and place cues being largely preserved. This result was replicated when babble instances were varied across independent experiments, suggesting that TFS conveys important voicing cues over what envelopes convey in multi-talker babble, a masker that is ubiquitous in everyday environments. This finding has implications for assistive listening devices that do not currently provide TFS cues, such as cochlear implants.

Download Full-text

Energetic and Informational Components of Speech-on-Speech Masking in Binaural Speech Intelligibility and Perceived Listening Effort

Trends in Hearing ◽

10.1177/2331216519854597 ◽

2019 ◽

Vol 23 ◽

pp. 233121651985459 ◽

Cited By ~ 8

Author(s):

Jan Rennies ◽

Virginia Best ◽

Elin Roverud ◽

Gerald Kidd

Keyword(s):

Speech Intelligibility ◽

Signal To Noise Ratio ◽

Spatial Separation ◽

Signal To Noise ◽

Listening Effort ◽

Complex Sound ◽

Time Frequency ◽

Sound Fields ◽

Energetic Masking

Speech perception in complex sound fields can greatly benefit from different unmasking cues to segregate the target from interfering voices. This study investigated the role of three unmasking cues (spatial separation, gender differences, and masker time reversal) on speech intelligibility and perceived listening effort in normal-hearing listeners. Speech intelligibility and categorically scaled listening effort were measured for a female target talker masked by two competing talkers with no unmasking cues or one to three unmasking cues. In addition to natural stimuli, all measurements were also conducted with glimpsed speech—which was created by removing the time–frequency tiles of the speech mixture in which the maskers dominated the mixture—to estimate the relative amounts of informational and energetic masking as well as the effort associated with source segregation. The results showed that all unmasking cues as well as glimpsing improved intelligibility and reduced listening effort and that providing more than one cue was beneficial in overcoming informational masking. The reduction in listening effort due to glimpsing corresponded to increases in signal-to-noise ratio of 8 to 18 dB, indicating that a significant amount of listening effort was devoted to segregating the target from the maskers. Furthermore, the benefit in listening effort for all unmasking cues extended well into the range of positive signal-to-noise ratios at which speech intelligibility was at ceiling, suggesting that listening effort is a useful tool for evaluating speech-on-speech masking conditions at typical conversational levels.

Download Full-text

Role of modulation magnitude and phase spectrum towards speech intelligibility

Speech Communication ◽

10.1016/j.specom.2010.10.004 ◽

2011 ◽

Vol 53 (3) ◽

pp. 327-339 ◽

Cited By ~ 15

Author(s):

Kuldip Paliwal ◽

Belinda Schwerin ◽

Kamil Wójcicki

Keyword(s):

Speech Intelligibility ◽

Phase Spectrum

Download Full-text

The role of short-time intensity and envelope power for speech intelligibility and psychoacoustic masking

The Journal of the Acoustical Society of America ◽

10.1121/1.4999059 ◽

2017 ◽

Vol 142 (2) ◽

pp. 1098-1111 ◽

Cited By ~ 9

Author(s):

Thomas Biberger ◽

Stephan D. Ewert

Keyword(s):

Speech Intelligibility ◽

Short Time

Download Full-text

The role of consonant duration and amplitude processing on speech intelligibility in noise

The Journal of the Acoustical Society of America ◽

10.1121/1.2935734 ◽

2008 ◽

Vol 123 (5) ◽

pp. 3865-3865

Author(s):

Jeffrey J. Digiovanni ◽

Ashley K. Stover

Keyword(s):

Speech Intelligibility

Download Full-text

The role of speaker familiarity in assessment of dysarthric speech intelligibility

The Journal of the Acoustical Society of America ◽

10.1121/1.4743378 ◽

2000 ◽

Vol 108 (5) ◽

pp. 2533-2533

Author(s):

Kuo‐You Huang

Keyword(s):

Speech Intelligibility ◽

Dysarthric Speech

Download Full-text

The Cross-Modal Suppressive Role of Visual Context on Speech Intelligibility: An ERP Study

Brain Sciences ◽

10.3390/brainsci10110810 ◽

2020 ◽

Vol 10 (11) ◽

pp. 810

Author(s):

Stanley Shen ◽

Jess R. Kerlin ◽

Heather Bortfeld ◽

Antoine J. Shahin

Keyword(s):

Speech Intelligibility ◽

High Reliability ◽

Word Meaning ◽

Event Related Potentials ◽

Auditory Event ◽

Related Potentials ◽

The Cross ◽

Auditory Erps ◽

Auditory Event Related Potentials

The efficacy of audiovisual (AV) integration is reflected in the degree of cross-modal suppression of the auditory event-related potentials (ERPs, P1-N1-P2), while stronger semantic encoding is reflected in enhanced late ERP negativities (e.g., N450). We hypothesized that increasing visual stimulus reliability should lead to more robust AV-integration and enhanced semantic prediction, reflected in suppression of auditory ERPs and enhanced N450, respectively. EEG was acquired while individuals watched and listened to clear and blurred videos of a speaker uttering intact or highly-intelligible degraded (vocoded) words and made binary judgments about word meaning (animate or inanimate). We found that intact speech evoked larger negativity between 280–527-ms than vocoded speech, suggestive of more robust semantic prediction for the intact signal. For visual reliability, we found that greater cross-modal ERP suppression occurred for clear than blurred videos prior to sound onset and for the P2 ERP. Additionally, the later semantic-related negativity tended to be larger for clear than blurred videos. These results suggest that the cross-modal effect is largely confined to suppression of early auditory networks with weak effect on networks associated with semantic prediction. However, the semantic-related visual effect on the late negativity may have been tempered by the vocoded signal’s high-reliability.

Download Full-text