Monaural multi-talker speech recognition using factorial speech processing models

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.

Download Full-text

Efficacy of a Cochlear Implant Simultaneous Analog Stimulation Strategy Coupled with a Monopolar Electrode Configuration

Annals of Otology Rhinology & Laryngology ◽

10.1177/000348940511401113 ◽

2005 ◽

Vol 114 (11) ◽

pp. 886-893 ◽

Cited By ~ 8

Author(s):

Li Xu ◽

Teresa A. Zwolan ◽

Catherine S. Thompson ◽

Bryan E. Pfingst

Keyword(s):

Speech Recognition ◽

Cochlear Implant ◽

Speech Processing ◽

Dynamic Range ◽

Recognition Performance ◽

Electrode Configuration ◽

Processing Strategy ◽

Open Set ◽

Hearing In Noise ◽

Noise Test

Objectives: The present study was performed to evaluate the efficacy and clinical feasibility of using monopolar stimulation with the Clarion Simultaneous Analog Stimulation (SAS) strategy in patients with cochlear implants. Methods: Speech recognition by 10 Clarion cochlear implant users was evaluated by means of 4 different speech processing strategy/electrode configuration combinations; ie, SAS and Continuous Interleaved Sampling (CIS) strategies were each used with monopolar (MP) and bipolar (BP) electrode configurations. The test measures included consonants, vowels, consonant-nucleus-consonant words, and Hearing in Noise Test sentences with a +10 dB signal-to-noise ratio. Additionally, subjective judgments of sound quality were obtained for each strategy/configuration combination. Results: All subjects but 1 demonstrated open-set speech recognition with the SAS/MP combination. The group mean Hearing in Noise Test sentence score for the SAS/MP combination was 31.6% (range, 0% to 92%) correct, as compared to 25.0%, 46.7%, and 37.8% correct for the CIS/BP, CIS/MP, and SAS/BP combinations, respectively. Intersubject variability was high, and there were no significant differences in mean speech recognition scores or mean preference ratings among the 4 strategy/configuration combinations tested. Individually, the best speech recognition performance was with the subject's everyday strategy/configuration combination in 72% of the applicable cases. If the everyday strategy was excluded from the analysis, the subjects performed best with the SAS/MP combination in 37.5% of the remaining cases. Conclusions: The SAS processing strategy with an MP electrode configuration gave reasonable speech recognition in most subjects, even though subjects had minimal previous experience with this strategy/configuration combination. The SAS/MP combination might be particularly appropriate for patients for whom a full dynamic range of electrical hearing could not be achieved with a BP configuration.

Download Full-text

An Ergonomic Framework for Researching and Designing Speech Recognition Technologies in Health Care with an Emphasis on Safety

Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care ◽

10.1177/2327857919081067 ◽

2019 ◽

Vol 8 (1) ◽

pp. 279-283

Author(s):

Tim Arnold ◽

Helen J. A. Fuller

Keyword(s):

Health Care ◽

Speech Recognition ◽

Human Factors ◽

Automatic Speech Recognition ◽

Speech Processing ◽

Clinical Work ◽

Error Tolerance ◽

Speech Interfaces ◽

Computer Based ◽

Processing Errors

Automatic speech recognition (ASR) systems and speech interfaces are becoming increasingly prevalent. This includes increases in and expansion of use of these technologies for supporting work in health care. Computer-based speech processing has been extensively studied and developed over decades. Speech processing tools have been fine-tuned through the work of Speech and Language Researchers. Researchers have previously and continue to describe speech processing errors in medicine. The discussion provided in this paper proposes an ergonomic framework for speech recognition to expand and further describe this view of speech processing in supporting clinical work. With this end in mind, we hope to build on previous work and emphasize the need for increased human factors involvement in this area while also facilitating the discussion of speech recognition in contexts that have been explored in the human factors domain. Human factors expertise can contribute through proactively describing and designing these critical interconnected socio-technical systems with error-tolerance in mind.

Download Full-text

Building Sequence Kernels for Speaker Verification and Word Recognition

Intelligent Information Technologies ◽

10.4018/978-1-59904-941-0.ch033 ◽

2011 ◽

pp. 575-589

Author(s):

Vincent Wan

Keyword(s):

Speech Recognition ◽

Speech Processing ◽

Kernel Methods ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Speaker Verification ◽

Dimensional Space ◽

Time Warping ◽

Recognition Systems ◽

Dynamic Time

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.

Download Full-text

Building Sequence Kernels for Speaker Verification and Word Recognition

Kernel Methods in Bioengineering, Signal and Image Processing ◽

10.4018/978-1-59904-042-4.ch010 ◽

2011 ◽

pp. 246-262

Author(s):

Vincent Wan

Keyword(s):

Speech Recognition ◽

Speech Processing ◽

Kernel Methods ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Speaker Verification ◽

Dimensional Space ◽

Time Warping ◽

Recognition Systems ◽

Dynamic Time

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.

Download Full-text

The Functional Anatomy of Speech Processing: From Auditory Cortex to Speech Recognition and Speech Production

fMRI ◽

10.1007/978-3-642-34342-1_9 ◽

2013 ◽

pp. 111-118

Author(s):

Gregory Hickok

Keyword(s):

Speech Recognition ◽

Auditory Cortex ◽

Speech Production ◽

Speech Processing ◽

Functional Anatomy

Download Full-text

Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak)

Development of Multimodal Interfaces: Active Listening and Synchrony - Lecture Notes in Computer Science ◽

10.1007/978-3-642-12397-9_19 ◽

2010 ◽

pp. 225-241 ◽

Cited By ~ 11

Author(s):

Jan Nouza ◽

Jindrich Zdansky ◽

Petr Cerva ◽

Jan Silovsky

Keyword(s):

Speech Recognition ◽

Case Studies ◽

Speech Processing ◽

Slavic Languages

Download Full-text

Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment

10.21236/ada623029 ◽

2015 ◽

Cited By ~ 1

Author(s):

John H. Hansen

Keyword(s):

Speech Recognition ◽

Speech Processing ◽

State Assessment ◽

Keyword Spotting ◽

Environmental Characterization ◽

Robust Speech Processing

Download Full-text

Auditory speech processing for scale-shift covariance and its evaluation in automatic speech recognition

Proceedings of 2010 IEEE International Symposium on Circuits and Systems ◽

10.1109/iscas.2010.5537725 ◽

2010 ◽

Cited By ~ 1

Author(s):

Roy D. Patterson ◽

Thomas C. Walters ◽

Jessica Monaghan ◽

Christian Feldbauer ◽

Toshio Irino

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Speech Processing ◽

Auditory Speech ◽

Scale Shift

Download Full-text

Coordination of Speech Recognition Devices in Intelligent Environments with Multiple Responsive Devices

Proceedings ◽

10.3390/proceedings2019031054 ◽

2019 ◽

Vol 31 (1) ◽

pp. 54

Author(s):

Benítez-Guijarro ◽

Callejas ◽

Noguera ◽

Benghazi

Keyword(s):

Speech Recognition ◽

Speech Processing ◽

Intelligent Environments ◽

Intelligent Environment ◽

Voice Signal ◽

Reliable Source ◽

Acoustic Quality ◽

Multiple Devices ◽

Single Input ◽

The Voice

Devices with oral interfaces are enabling new interesting interaction scenarios and ways of interaction in ambient intelligence settings. The use of several of such devices in the same environment opens up the possibility to compare the inputs gathered from each one of them and perform a more accurate recognition and processing of user speech. However, the combination of multiple devices presents coordination challenges, as the processing of one voice signal by different speech processing units may result in conflicting outputs and it is necessary to decide which is the most reliable source. This paper presents an approach to rank several sources of spoken input in multi-device environments in order to give preference to the input with the highest estimated quality. The voice signals received by the multiple devices are assessed in terms of their calculated acoustic quality and the reliability of the speech recognition hypotheses produced. After this assessment, each input is assigned a unique score that allows the audio sources to be ranked so as to pick the best to be processed by the system. In order to validate this approach, we have performed an evaluation using a corpus of 4608 audios recorded in a two-room intelligent environment with 24 microphones. The experimental results show that our ranking approach makes it possible to successfully orchestrate an increasing number of acoustic inputs, obtaining better recognition rates than considering a single input, both in clear and noisy settings.

Download Full-text