scholarly journals Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition

Author(s):  
Jinxi Guo ◽  
Gautam Tiwari ◽  
Jasha Droppo ◽  
Maarten Van Segbroeck ◽  
Che-Wei Huang ◽  
...  
Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3063
Author(s):  
Aleksandr Laptev ◽  
Andrei Andrusenko ◽  
Ivan Podluzhny ◽  
Anton Mitrofanov ◽  
Ivan Medennikov ◽  
...  

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.


2021 ◽  
Author(s):  
Zhong Meng ◽  
Yu Wu ◽  
Naoyuki Kanda ◽  
Liang Lu ◽  
Xie Chen ◽  
...  

2019 ◽  
Vol 8 (2S11) ◽  
pp. 2350-2352

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.


2019 ◽  
Vol 9 (21) ◽  
pp. 4639 ◽  
Author(s):  
Long Wu ◽  
Ta Li ◽  
Li Wang ◽  
Yonghong Yan

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.


Author(s):  
Chu-Xiong Qin ◽  
Wen-Lin Zhang ◽  
Dan Qu

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.


Author(s):  
Olov Engwall ◽  
José Lopes ◽  
Ronald Cumbal

AbstractThe large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.


2021 ◽  
Author(s):  
Kehinde Lydia Ajayi ◽  
Victor Azeta ◽  
Isaac Odun-Ayo ◽  
Ambrose Azeta ◽  
Ajayi Peter Taiwo ◽  
...  

Abstract One of the current research areas is speech recognition by aiding in the recognition of speech signals through computer applications. In this research paper, Acoustic Nudging, (AN) Model is used in re-formulating the persistence automatic speech recognition (ASR) errors that involves user’s acoustic irrational behavior which alters speech recognition accuracy. GMM helped in addressing low-resourced attribute of Yorùbá language to achieve better accuracy and system performance. From the simulated results given, it is observed that proposed Acoustic Nudging-based Gaussian Mixture Model (ANGM) improves accuracy and system performance which is evaluated based on Word Recognition Rate (WRR) and Word Error Rate (WER)given by validation accuracy, testing accuracy, and training accuracy. The evaluation results for the mean WRR accuracy achieved for the ANGM model is 95.277% and the mean Word Error Rate (WER) is 4.723%when compared to existing models. This approach thereby reduce error rate by 1.1%, 0.5%, 0.8%, 0.3%, and 1.4% when compared with other models. Therefore this work was able to discover a foundation for advancing current understanding of under-resourced languages and at the same time, development of accurate and precise model for speech recognition.


2020 ◽  
Vol 2 (2) ◽  
pp. 7-13
Author(s):  
Andi Nasri

Dengan semakin berkembangnya teknologi speech recognition, berbagai software yang bertujuan untuk memudahkan orang tunarungu dalam berkomunikasi dengan yang lainnya telah dikembangkan. Sistem tersebut menterjemahkan suara ucapan menjadi bahasa isyarat atau sebaliknya bahasa isyarat diterjemahkan ke suara ucapan. Sistem tersebut sudah dikembangkan dalam berbagai bahasa seperti bahasa Inggris, Arab, Spanyol, Meksiko, Indonesia dan lain-lain. Khusus untuk bahasa Indonesia mulai juga sudah yang mencoba melakukan penelitian untuk membuat system seperti tersebut. Namun system yang dibuat masih terbatas pada Automatic Speech Recognition (ASR) yang digunakan dimana mempunyai kosa-kata yang terbatas. Dalam penelitian ini bertujuan untuk mengembangkan sistem penterjemah suara ucapan bahasa Indonesia ke Sistem Bahasa Isyarat Indonesia (SIBI) dengan data korpus yang lebih besar dan meggunkanan continue speech recognition  untuk meningkatkan akurasi system.Dari hasil pengujian system menunjukan diperoleh hasil akurasi sebesar rata-rata 90,50 % dan Word Error Rate (WER)  9,50%. Hasil akurasi lebih tinggi dibandingkan penelitian kedua  48,75%  dan penelitan pertama 66,67%. Disamping itu system juga dapat mengenali kata yang diucapkan secara kontinyu atau pengucapan kalimat. Kemudian hasil pengujian kinerja system mencapai         0,83 detik untuk Speech to Text  dan 8,25 detik untuk speech to sign.


Sign in / Sign up

Export Citation Format

Share Document