A De Novo Divide-and-Merge Paradigm for Acoustic Model Optimization in Automatic Speech Recognition

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/513 ◽

2020 ◽

Author(s):

Conghui Tan ◽

Di Jiang ◽

Jinhua Peng ◽

Xueyang Wu ◽

Qian Xu ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

De Novo ◽

Superior Performance ◽

Acoustic Model ◽

Acoustic Models ◽

Public Data ◽

Speech Data ◽

Low Efficiency ◽

Novel Algorithms

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.

Download Full-text

Using Privacy-Transformed Speech in the Automatic Speech Recognition Acoustic Model Training

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200601 ◽

2020 ◽

Author(s):

Askars Salimbajevs

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Speaker Verification ◽

Voice Conversion ◽

Acoustic Model ◽

Acoustic Models ◽

Speech Data ◽

Model Training ◽

The Voice

Automatic Speech Recognition (ASR) requires huge amounts of real user speech data to reach state-of-the-art performance. However, speech data conveys sensitive speaker attributes like identity that can be inferred and exploited for malicious purposes. Therefore, there is an interest in the collection of anonymized speech data that is processed by some voice conversion method. In this paper, we evaluate one of the voice conversion methods on Latvian speech data and also investigate if privacy-transformed data can be used to improve ASR acoustic models. Results show the effectiveness of voice conversion against state-of-the-art speaker verification models on Latvian speech and the effectiveness of using privacy-transformed data in ASR training.

Download Full-text

Triphone Model Based Novel Kannada Continuous Speech Recognition System using Kaldi Tool

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7210.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 452-458

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Acoustic Model ◽

Mel Frequency Cepstral Coefficients ◽

Automatic Speech Recognition System ◽

Data Set ◽

Acoustic Models ◽

Recognition Systems

Accent is one of the issue for speech recognition systems. Automatic Speech Recognition systems must yield high performance for different dialects. In this work, Neutral Kannada Automatic Speech Recognition is implemented using Kaldi software for monophone modelling and triphone modeling. The acoustic models are constructed using the techniques such as monophone, triphone1, triphone2, triphone3. In triphone modeling, grouping of interphones is performed. Feature extraction is performed by Mel Frequency Cepstral Coefficients. The system performance is analysed by measuring Word Error Rate using different acoustic models. To know the robustness and performance of the Neutral Kannada Automatic Speech Recognition system for different dialects in Kannada, the system is tested for North Kannada accent. Better sentence accuracy is obtained for Neutral Kannada Automatic Speech Recognition system and is about 90%. The performance is degraded, when tested for North Kannada accent and the accuracy obtained is around 77%. The performance is degraded due to the increasing mismatch between the training and testing data set, as the Neutral Kannada Automatic Speech Recognition system is trained only for neutral Kannada acoustic model and doesn't include north Kannada acoustic model. Interactive Kannada voice response system is implemented to identify continuous Kannada speech sentences.

Download Full-text

Acoustic model merging using acoustic models from multilingual speakers for automatic speech recognition

2014 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp.2014.6973492 ◽

2014 ◽

Cited By ~ 1

Author(s):

Tien-Ping Tan ◽

Laurent Besacier ◽

Benjamin Lecouteux

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Acoustic Model ◽

Acoustic Models ◽

Model Merging

Download Full-text

Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition

10.21437/interspeech.2015-597 ◽

2015 ◽

Author(s):

Zhong-Qiu Wang ◽

DeLiang Wang

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Acoustic Model ◽

Speech Separation ◽

Joint Training

Download Full-text

Text Corpus and Acoustic Model Addition for Indonesian-Arabic Code-switching in Automatic Speech Recognition System

2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) ◽

10.1109/icaicta.2019.8904183 ◽

2019 ◽

Author(s):

Rizky Elzandi Barik ◽

Dessi Puji Lestari

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Code Switching ◽

Speech Recognition System ◽

Acoustic Model ◽

Automatic Speech Recognition System ◽

Text Corpus

Download Full-text

Creating language and acoustic models using Kaldi to build an automatic speech recognition system for Kannada language

2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT) ◽

10.1109/rteict.2017.8256578 ◽

2017 ◽

Cited By ~ 3

Author(s):

Yadava G Thimmaraja ◽

H S Jayanna

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Acoustic Models ◽

Kannada Language

Download Full-text

Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition

Archives of Acoustics ◽

10.1515/aoa-2015-0021 ◽

2015 ◽

Vol 40 (2) ◽

pp. 191-195 ◽

Cited By ~ 10

Author(s):

Łukasz Brocki ◽

Krzysztof Marasek

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Superior Performance ◽

Acoustic Model ◽

Short Term ◽

Term Memory ◽

Internal Memory ◽

Long Short Term Memory

Abstract This paper describes a Deep Belief Neural Network (DBNN) and Bidirectional Long-Short Term Memory (LSTM) hybrid used as an acoustic model for Speech Recognition. It was demonstrated by many independent researchers that DBNNs exhibit superior performance to other known machine learning frameworks in terms of speech recognition accuracy. Their superiority comes from the fact that these are deep learning networks. However, a trained DBNN is simply a feed-forward network with no internal memory, unlike Recurrent Neural Networks (RNNs) which are Turing complete and do posses internal memory, thus allowing them to make use of longer context. In this paper, an experiment is performed to make a hybrid of a DBNN with an advanced bidirectional RNN used to process its output. Results show that the use of the new DBNN-BLSTM hybrid as the acoustic model for the Large Vocabulary Continuous Speech Recognition (LVCSR) increases word recognition accuracy. However, the new model has many parameters and in some cases it may suffer performance issues in real-time applications.

Download Full-text

Modelo Acústico y de Lenguaje del Idioma Español para el dialecto Cucuteño, Orientado al Reconocimiento Automático del Habla

Ingeniería ◽

10.14483/23448393.11616 ◽

2017 ◽

Vol 22 (3) ◽

pp. 362 ◽

Cited By ~ 1

Author(s):

Juan David Celis Nuñez ◽

Rodrigo Andres Llanos Castro ◽

Byron Medina Delgado ◽

Sergio Basilio Sepúlveda Mora ◽

Sergio Alexander Castro Casadiego

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Language Model ◽

Spanish Language ◽

San Jose ◽

Training Process ◽

Accuracy Rate ◽

Statistical Language Model ◽

Acoustic Models ◽

The Voice

Context: Automatic speech recognition requires the development of language and acoustic models for different existing dialects. The purpose of this research is the training of an acoustic model, a statistical language model and a grammar language model for the Spanish language, specifically for the dialect of the city of San Jose de Cucuta, Colombia, that can be used in a command control system. Existing models for the Spanish language have problems in the recognition of the fundamental frequency and the spectral content, the accent, pronunciation, tone or simply the language model for Cucuta's dialect.Method: in this project, we used Raspberry Pi B+ embedded system with Raspbian operating system which is a Linux distribution and two open source software, namely CMU-Cambridge Statistical Language Modeling Toolkit from the University of Cambridge and CMU Sphinx from Carnegie Mellon University; these software are based on Hidden Markov Models for the calculation of voice parameters. Besides, we used 1913 recorded audios with the voice of people from San Jose de Cucuta and Norte de Santander department. These audios were used for training and testing the automatic speech recognition system.Results: we obtained a language model that consists of two files, one is the statistical language model (.lm), and the other is the jsgf grammar model (.jsgf). Regarding the acoustic component, two models were trained, one of them with an improved version which had a 100 % accuracy rate in the training results and 83 % accuracy rate in the audio tests for command recognition. Finally, we elaborated a manual for the creation of acoustic and language models with CMU Sphinx software.Conclusions: The number of participants in the training process of the language and acoustic models has a significant influence on the quality of the voice processing of the recognizer. The use of a large dictionary for the training process and a short dictionary with the command words for the implementation is important to get a better response of the automatic speech recognition system. Considering the accuracy rate above 80 % in the voice recognition tests, the proposed models are suitable for applications oriented to the assistance of visual or motion impairment people.

Download Full-text

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327911666210118143758 ◽

2021 ◽

Vol 11 ◽

Author(s):

Ankit Kumar ◽

Rajesh Kumar Aggarwal

Keyword(s):

Neural Network ◽

Speech Recognition ◽

High Accuracy ◽

Training Data ◽

Acoustic Modeling ◽

Training Dataset ◽

Acoustic Model ◽

Indian Languages ◽

Acoustic Models ◽

Asr System

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

Download Full-text

A back-off discriminative acoustic model for automatic speech recognition

10.21437/interspeech.2009-83 ◽

2009 ◽

Author(s):

Hung-An Chang ◽

James R. Glass

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Acoustic Model

Download Full-text