Training Wideband Acoustic Models in the Cepstral Domain Using Mixed-Bandwidth Training Data for Speech Recognition

Michael L. Seltzer; Alejandro Acero

doi:10.1121/1.3625676

Limited Training Data Robust Speech Recognition Using Kernel-Based Acoustic Models

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings ◽

10.1109/icassp.2006.1660226 ◽

2006 ◽

Cited By ~ 1

Author(s):

M. Schaffoner ◽

S.E. Kruger ◽

E. Andelic ◽

M. Katz ◽

A. Wendemuth

Keyword(s):

Speech Recognition ◽

Training Data ◽

Robust Speech Recognition ◽

Acoustic Models

Download Full-text

Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition

Language Resources and Evaluation ◽

10.1007/s10579-020-09505-5 ◽

2020 ◽

Vol 54 (4) ◽

pp. 975-998

Author(s):

Eiman Alsharhan ◽

Allan Ramsay

Keyword(s):

Speech Recognition ◽

Arabic Language ◽

Learning Curves ◽

Training Data ◽

Language Resources ◽

Acoustic Models ◽

Pronunciation Variation ◽

Register Variation ◽

Fine Grained ◽

Dialectal Arabic

Abstract Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar levels of cross-dialect and cross-register acoustic variability, but these effects have been under-studied. This paper is an experimental analysis of the interaction between classical ASR corpus-compensation methods (feature selection, data selection, gender-dependent acoustic models) and the dialect-dependent/register-dependent variation among Arabic ASR corpora. The first interaction studied in this paper is that between acoustic recording quality and discrete pronunciation variation. Discrete pronunciation variation can be compensated by using grapheme-based instead of phone-based acoustic models, and by filtering out speakers with insufficient training data; the latter technique also helps to compensate for poor recording quality, which is further compensated by eliminating delta-delta acoustic features. All three techniques, together, reduce Word Error Rate (WER) by between 3.24% and 5.35%. The second aspect of dialect and register variation to be considered is variation in the fine-grained acoustic pronunciations of each phoneme in the language. Experimental results prove that gender and dialect are the principal components of variation in speech, therefore, building gender and dialect-specific models leads to substantial decreases in WER. In order to further explore the degree of acoustic differences between phone models required for each of the dialects of Arabic, cross-dialect experiments are conducted to measure how far apart Arabic dialects are acoustically in order to make a better decision about the minimal number of recognition systems needed to cover all dialectal Arabic. Finally, the research addresses an important question: how much training data is needed for building efficient speaker-independent ASR systems? This includes developing some learning curves to find out how large must the training set be to achieve acceptable performance.

Download Full-text

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327911666210118143758 ◽

2021 ◽

Vol 11 ◽

Author(s):

Ankit Kumar ◽

Rajesh Kumar Aggarwal

Keyword(s):

Neural Network ◽

Speech Recognition ◽

High Accuracy ◽

Training Data ◽

Acoustic Modeling ◽

Training Dataset ◽

Acoustic Model ◽

Indian Languages ◽

Acoustic Models ◽

Asr System

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

Download Full-text

Training Wideband Acoustic Models Using Mixed-Bandwidth Training Data for Speech Recognition

IEEE Transactions on Audio Speech and Language Processing ◽

10.1109/tasl.2006.876774 ◽

2007 ◽

Vol 15 (1) ◽

pp. 235-245 ◽

Cited By ~ 6

Author(s):

Michael L. Seltzer ◽

Alex Acero

Keyword(s):

Speech Recognition ◽

Training Data ◽

Acoustic Models

Download Full-text

CTC Training of Multi-Phone Acoustic Models for Speech Recognition

10.21437/interspeech.2017-505 ◽

2017 ◽

Cited By ~ 3

Author(s):

Olivier Siohan

Keyword(s):

Speech Recognition ◽

Acoustic Models

Download Full-text

Phonetic Variation Modeling and a Language Model Adaptation for Korean English Code-Switching Speech Recognition

Applied Sciences ◽

10.3390/app11062866 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2866

Author(s):

Damheo Lee ◽

Donghyun Kim ◽

Seung Yun ◽

Sanghun Kim

Keyword(s):

Speech Recognition ◽

Language Model ◽

Reduction Rate ◽

Code Switching ◽

Training Data ◽

Target Domain ◽

Phonetic Variation ◽

Language Model Adaptation ◽

Imbalanced Training Data ◽

Lm Adaptation

In this paper, we propose a new method for code-switching (CS) automatic speech recognition (ASR) in Korean. First, the phonetic variations in English pronunciation spoken by Korean speakers should be considered. Thus, we tried to find a unified pronunciation model based on phonetic knowledge and deep learning. Second, we extracted the CS sentences semantically similar to the target domain and then applied the language model (LM) adaptation to solve the biased modeling toward Korean due to the imbalanced training data. In this experiment, training data were AI Hub (1033 h) in Korean and Librispeech (960 h) in English. As a result, when compared to the baseline, the proposed method improved the error reduction rate (ERR) by up to 11.6% with phonetic variant modeling and by 17.3% when semantically similar sentences were applied to the LM adaptation. If we considered only English words, the word correction rate improved up to 24.2% compared to that of the baseline. The proposed method seems to be very effective in CS speech recognition.

Download Full-text