scholarly journals Monaural multi-talker speech recognition using factorial speech processing models

2018 ◽  
Vol 98 ◽  
pp. 1-16 ◽  
Author(s):  
Mahdi Khademian ◽  
Mohammad Mehdi Homayounpour
2017 ◽  
Vol 60 (9) ◽  
pp. 2394-2405 ◽  
Author(s):  
Lionel Fontan ◽  
Isabelle Ferrané ◽  
Jérôme Farinas ◽  
Julien Pinquier ◽  
Julien Tardieu ◽  
...  

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.


2005 ◽  
Vol 114 (11) ◽  
pp. 886-893 ◽  
Author(s):  
Li Xu ◽  
Teresa A. Zwolan ◽  
Catherine S. Thompson ◽  
Bryan E. Pfingst

Objectives: The present study was performed to evaluate the efficacy and clinical feasibility of using monopolar stimulation with the Clarion Simultaneous Analog Stimulation (SAS) strategy in patients with cochlear implants. Methods: Speech recognition by 10 Clarion cochlear implant users was evaluated by means of 4 different speech processing strategy/electrode configuration combinations; ie, SAS and Continuous Interleaved Sampling (CIS) strategies were each used with monopolar (MP) and bipolar (BP) electrode configurations. The test measures included consonants, vowels, consonant-nucleus-consonant words, and Hearing in Noise Test sentences with a +10 dB signal-to-noise ratio. Additionally, subjective judgments of sound quality were obtained for each strategy/configuration combination. Results: All subjects but 1 demonstrated open-set speech recognition with the SAS/MP combination. The group mean Hearing in Noise Test sentence score for the SAS/MP combination was 31.6% (range, 0% to 92%) correct, as compared to 25.0%, 46.7%, and 37.8% correct for the CIS/BP, CIS/MP, and SAS/BP combinations, respectively. Intersubject variability was high, and there were no significant differences in mean speech recognition scores or mean preference ratings among the 4 strategy/configuration combinations tested. Individually, the best speech recognition performance was with the subject's everyday strategy/configuration combination in 72% of the applicable cases. If the everyday strategy was excluded from the analysis, the subjects performed best with the SAS/MP combination in 37.5% of the remaining cases. Conclusions: The SAS processing strategy with an MP electrode configuration gave reasonable speech recognition in most subjects, even though subjects had minimal previous experience with this strategy/configuration combination. The SAS/MP combination might be particularly appropriate for patients for whom a full dynamic range of electrical hearing could not be achieved with a BP configuration.


Author(s):  
Tim Arnold ◽  
Helen J. A. Fuller

Automatic speech recognition (ASR) systems and speech interfaces are becoming increasingly prevalent. This includes increases in and expansion of use of these technologies for supporting work in health care. Computer-based speech processing has been extensively studied and developed over decades. Speech processing tools have been fine-tuned through the work of Speech and Language Researchers. Researchers have previously and continue to describe speech processing errors in medicine. The discussion provided in this paper proposes an ergonomic framework for speech recognition to expand and further describe this view of speech processing in supporting clinical work. With this end in mind, we hope to build on previous work and emphasize the need for increased human factors involvement in this area while also facilitating the discussion of speech recognition in contexts that have been explored in the human factors domain. Human factors expertise can contribute through proactively describing and designing these critical interconnected socio-technical systems with error-tolerance in mind.


Author(s):  
Vincent Wan

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.


Author(s):  
Vincent Wan

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.


Proceedings ◽  
2019 ◽  
Vol 31 (1) ◽  
pp. 54
Author(s):  
Benítez-Guijarro ◽  
Callejas ◽  
Noguera ◽  
Benghazi

Devices with oral interfaces are enabling new interesting interaction scenarios and ways of interaction in ambient intelligence settings. The use of several of such devices in the same environment opens up the possibility to compare the inputs gathered from each one of them and perform a more accurate recognition and processing of user speech. However, the combination of multiple devices presents coordination challenges, as the processing of one voice signal by different speech processing units may result in conflicting outputs and it is necessary to decide which is the most reliable source. This paper presents an approach to rank several sources of spoken input in multi-device environments in order to give preference to the input with the highest estimated quality. The voice signals received by the multiple devices are assessed in terms of their calculated acoustic quality and the reliability of the speech recognition hypotheses produced. After this assessment, each input is assigned a unique score that allows the audio sources to be ranked so as to pick the best to be processed by the system. In order to validate this approach, we have performed an evaluation using a corpus of 4608 audios recorded in a two-room intelligent environment with 24 microphones. The experimental results show that our ranking approach makes it possible to successfully orchestrate an increasing number of acoustic inputs, obtaining better recognition rates than considering a single input, both in clear and noisy settings.


Sign in / Sign up

Export Citation Format

Share Document