Comparison on Neural Network based acoustic model in Mongolian speech recognition

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

Download Full-text

Convolutional neural network acoustic model for robust Indonesian speech recognition in noisy environment

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/803/1/012027 ◽

2020 ◽

Vol 803 ◽

pp. 012027

Author(s):

M J Budiman ◽

D P Lestari

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Convolutional Neural Network ◽

Acoustic Model ◽

Noisy Environment

Download Full-text

Geo-location dependent deep neural network acoustic model for speech recognition

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2016.7472803 ◽

2016 ◽

Cited By ~ 1

Author(s):

Guoli Ye ◽

Chaojun Liu ◽

Yifan Gong

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Deep Neural Network ◽

Acoustic Model

Download Full-text

Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

The Journal of the Acoustical Society of America ◽

10.1121/1.5039837 ◽

2018 ◽

Vol 143 (6) ◽

pp. 3207-3219 ◽

Cited By ~ 3

Author(s):

Di He ◽

Boon Pang Lim ◽

Xuesong Yang ◽

Mark Hasegawa-Johnson ◽

Deming Chen

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Deep Neural Network ◽

Acoustic Model

Download Full-text

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Applied Sciences ◽

10.3390/app10124091 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4091 ◽

Cited By ~ 1

Author(s):

Yoo Rhee Oh ◽

Kiyoung Park ◽

Jeon Gyu Park

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Deep Neural Network ◽

Computation Method ◽

Acoustic Model ◽

Voice Activity Detector ◽

Context Sensitive ◽

Speech Features ◽

Voice Activity ◽

Main Thread

This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.

Download Full-text