scholarly journals Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Author(s):  
Zhong Meng ◽  
Yu Wu ◽  
Naoyuki Kanda ◽  
Liang Lu ◽  
Xie Chen ◽  
...  
Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3063
Author(s):  
Aleksandr Laptev ◽  
Andrei Andrusenko ◽  
Ivan Podluzhny ◽  
Anton Mitrofanov ◽  
Ivan Medennikov ◽  
...  

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.


Author(s):  
Vincent Elbert Budiman ◽  
Andreas Widjaja

Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.  


Author(s):  
Jinxi Guo ◽  
Gautam Tiwari ◽  
Jasha Droppo ◽  
Maarten Van Segbroeck ◽  
Che-Wei Huang ◽  
...  

2019 ◽  
Vol 8 (2S11) ◽  
pp. 2350-2352

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.


2021 ◽  
pp. 1-13
Author(s):  
Hamzah A. Alsayadi ◽  
Abdelaziz A. Abdelhamid ◽  
Islam Hegazy ◽  
Zaki T. Fayed

Arabic language has a set of sound letters called diacritics, these diacritics play an essential role in the meaning of words and their articulations. The change in some diacritics leads to a change in the context of the sentence. However, the existence of these letters in the corpus transcription affects the accuracy of speech recognition. In this paper, we investigate the effect of diactrics on the Arabic speech recognition based end-to-end deep learning. The applied end-to-end approach includes CNN-LSTM and attention-based technique presented in the state-of-the-art framework namely, Espresso using Pytorch. In addition, and to the best of our knowledge, the approach of CNN-LSTM with attention-based has not been used in the task of Arabic Automatic speech recognition (ASR). To fill this gap, this paper proposes a new approach based on CNN-LSTM with attention based method for Arabic ASR. The language model in this approach is trained using RNN-LM and LSTM-LM and based on nondiacritized transcription of the speech corpus. The Standard Arabic Single Speaker Corpus (SASSC), after omitting the diacritics, is used to train and test the deep learning model. Experimental results show that the removal of diacritics decreased out-of-vocabulary and perplexity of the language model. In addition, the word error rate (WER) is significantly improved when compared to diacritized data. The achieved average reduction in WER is 13.52%.


Author(s):  
Zhong Meng ◽  
Sarangarajan Parthasarathy ◽  
Eric Sun ◽  
Yashesh Gaur ◽  
Naoyuki Kanda ◽  
...  

2021 ◽  
Vol 336 ◽  
pp. 06016
Author(s):  
Taiben Suan ◽  
Rangzhuoma Cai ◽  
Zhijie Cai ◽  
Ba Zu ◽  
Baojia Gong

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.


2019 ◽  
Vol 9 (21) ◽  
pp. 4639 ◽  
Author(s):  
Long Wu ◽  
Ta Li ◽  
Li Wang ◽  
Yonghong Yan

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.


Sign in / Sign up

Export Citation Format

Share Document