User-Friendly Automatic Transcription of Low-Resource Languages: Plugging ESPnet into Elpis

This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile.

Download Full-text

Triphone Model Based Novel Kannada Continuous Speech Recognition System using Kaldi Tool

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7210.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 452-458

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Acoustic Model ◽

Mel Frequency Cepstral Coefficients ◽

Automatic Speech Recognition System ◽

Data Set ◽

Acoustic Models ◽

Recognition Systems

Accent is one of the issue for speech recognition systems. Automatic Speech Recognition systems must yield high performance for different dialects. In this work, Neutral Kannada Automatic Speech Recognition is implemented using Kaldi software for monophone modelling and triphone modeling. The acoustic models are constructed using the techniques such as monophone, triphone1, triphone2, triphone3. In triphone modeling, grouping of interphones is performed. Feature extraction is performed by Mel Frequency Cepstral Coefficients. The system performance is analysed by measuring Word Error Rate using different acoustic models. To know the robustness and performance of the Neutral Kannada Automatic Speech Recognition system for different dialects in Kannada, the system is tested for North Kannada accent. Better sentence accuracy is obtained for Neutral Kannada Automatic Speech Recognition system and is about 90%. The performance is degraded, when tested for North Kannada accent and the accuracy obtained is around 77%. The performance is degraded due to the increasing mismatch between the training and testing data set, as the Neutral Kannada Automatic Speech Recognition system is trained only for neutral Kannada acoustic model and doesn't include north Kannada acoustic model. Interactive Kannada voice response system is implemented to identify continuous Kannada speech sentences.

Download Full-text

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

10.21437/interspeech.2019-2511 ◽

2019 ◽

Author(s):

Li Chai ◽

Jun Du ◽

Chin-Hui Lee

Keyword(s):

Speech Recognition ◽

Speech Enhancement ◽

Automatic Speech Recognition ◽

Cross Entropy ◽

Front End

Download Full-text

Comparing Front-End Enhancement Techniques and Multiconditioned Training for Robust Automatic Speech Recognition

Text, Speech, and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-030-27947-9_28 ◽

2019 ◽

pp. 329-340 ◽

Cited By ~ 1

Author(s):

Meet H. Soni ◽

Sonal Joshi ◽

Ashish Panda

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Front End

Download Full-text

A De Novo Divide-and-Merge Paradigm for Acoustic Model Optimization in Automatic Speech Recognition

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/513 ◽

2020 ◽

Author(s):

Conghui Tan ◽

Di Jiang ◽

Jinhua Peng ◽

Xueyang Wu ◽

Qian Xu ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

De Novo ◽

Superior Performance ◽

Acoustic Model ◽

Acoustic Models ◽

Public Data ◽

Speech Data ◽

Low Efficiency ◽

Novel Algorithms

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.

Download Full-text

The Application of Probabilistic Neural Network in Speech Recognition Based on Partition Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.2173 ◽

2012 ◽

Vol 263-266 ◽

pp. 2173-2178

Author(s):

Xin Guang Li ◽

Min Feng Yao ◽

Li Rui Jian ◽

Zhen Jiang Li

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Clustering Algorithm ◽

Probabilistic Neural Network ◽

Back Propagation ◽

Back Propagation Neural Network ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Partition Clustering

A probabilistic neural network (PNN) speech recognition model based on the partition clustering algorithm is proposed in this paper. The most important advantage of PNN is that training is easy and instantaneous. Therefore, PNN is capable of dealing with real time speech recognition. Besides, in order to increase the performance of PNN, the selection of data set is one of the most important issues. In this paper, using the partition clustering algorithm to select data is proposed. The proposed model is tested on two data sets from the field of spoken Arabic numbers, with promising results. The performance of the proposed model is compared to single back propagation neural network and integrated back propagation neural network. The final comparison result shows that the proposed model performs better than the other two neural networks, and has an accuracy rate of 92.41%.

Download Full-text