Auditory model based optimization of MFCCs improves automatic speech recognition performance

Punjabi language is a tonal language belonging to an Indo-Aryan language family and has a number of speakers all around the world. Punjabi language has gained acceptability in the media & communication and therefore deserves to have a place in the growing field of automatic speech recognition which has been explored already for a number of other Indian and foreign languages successfully. Some work has been done in the field of isolated word speech recognition for Punjabi language, but only using whole word based acoustic models. A phone based approach has yet to be applied for Punjabi language speech recognition. This paper describes an automatic speech recognizer that recognizes isolated word speech and connected word speech using a triphone based acoustic model on the HTK 3.4.1 speech Engine and compares the performance with acoustic whole word model based ASR system. Word recognition accuracy of isolated word speech was 92.05% for acoustic whole word model based system and 97.14% for acoustic triphone model based system whereas word recognition accuracy of connected word speech was 87.75% for acoustic whole word model based system and 91.62% for acoustic triphone model based system.

Download Full-text

Speech Feature Evaluation for Bangla Automatic Speech Recognition

Technical Challenges and Design Issues in Bangla Language Processing ◽

10.4018/978-1-4666-3970-6.ch009 ◽

2013 ◽

pp. 169-208 ◽

Cited By ~ 1

Author(s):

Mohammed Rokibul Alam Kotwal ◽

Foyzul Hassan ◽

Mohammad Nurul Huda

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition Performance ◽

Dynamic Parameters ◽

Mel Frequency Cepstral Coefficients ◽

Phoneme Recognition ◽

Hybrid Features ◽

Feature Evaluation ◽

Speech Feature ◽

Speech Features

This chapter presents Bangla (widely known as Bengali) Automatic Speech Recognition (ASR) techniques by evaluating the different speech features, such as Mel Frequency Cepstral Coefficients (MFCCs), Local Features (LFs), phoneme probabilities extracted by time delay artificial neural networks of different architectures. Moreover, canonicalization of speech features is also performed for Gender-Independent (GI) ASR. In the canonicalization process, the authors have designed three classifiers by male, female, and GI speakers, and extracted the output probabilities from these classifiers for measuring the maximum. The maximization of output probabilities for each speech file provides higher correctness and accuracies for GI speech recognition. Besides, dynamic parameters (velocity and acceleration coefficients) are also used in the experiments for obtaining higher accuracy in phoneme recognition. From the experiments, it is also shown that dynamic parameters with hybrid features also increase the phoneme recognition performance in a certain extent. These parameters not only increase the accuracy of the ASR system, but also reduce the computation complexity of Hidden Markov Model (HMM)-based classifiers with fewer mixture components.

Download Full-text