On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Measuring the acceptable word error rate of machine-generated webcast transcripts

10.21437/interspeech.2006-40 ◽

2006 ◽

Cited By ~ 1

Author(s):

Cosmin Munteanu ◽

Gerald Penn ◽

Ron Baecker ◽

Elaine Toms ◽

David James

Keyword(s):

Error Rate ◽

Word Error Rate

Download Full-text

Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate?

10.21437/interspeech.2009-607 ◽

2009 ◽

Author(s):

Paul Deléglise ◽

Yannick Estève ◽

Sylvain Meignier ◽

Teva Merlin

Keyword(s):

Error Rate ◽

Word Error Rate ◽

Asr System

Download Full-text

Attention-Based Fully Gated CNN-BGRU for Russian Handwritten Text

Journal of Imaging ◽

10.3390/jimaging6120141 ◽

2020 ◽

Vol 6 (12) ◽

pp. 141

Author(s):

Abdelrahman Abdallah ◽

Mohamed Hamada ◽

Daniyar Nurseitov

Keyword(s):

Error Rate ◽

Handwriting Recognition ◽

Text Recognition ◽

P Value ◽

Word Error Rate ◽

Test Dataset ◽

Handwritten Text ◽

Proposed Model ◽

Handwritten Text Recognition ◽

Gated Recurrent Unit

This article considers the task of handwritten text recognition using attention-based encoder–decoder networks trained in the Kazakh and Russian languages. We have developed a novel deep neural network model based on a fully gated CNN, supported by multiple bidirectional gated recurrent unit (BGRU) and attention mechanisms to manipulate sophisticated features that achieve 0.045 Character Error Rate (CER), 0.192 Word Error Rate (WER), and 0.253 Sequence Error Rate (SER) for the first test dataset and 0.064 CER, 0.24 WER and 0.361 SER for the second test dataset. Our proposed model is the first work to handle handwriting recognition models in Kazakh and Russian languages. Our results confirm the importance of our proposed Attention-Gated-CNN-BGRU approach for training handwriting text recognition and indicate that it can lead to statistically significant improvements (p-value < 0.05) in the sensitivity (recall) over the tests dataset. The proposed method’s performance was evaluated using handwritten text databases of three languages: English, Russian, and Kazakh. It demonstrates better results on the Handwritten Kazakh and Russian (HKR) dataset than the other well-known models.

Download Full-text

Closed-Form Word Error Rate Analysis for Successive Interference Cancellation Decoders

IEEE Transactions on Wireless Communications ◽

10.1109/twc.2018.2875699 ◽

2018 ◽

Vol 17 (12) ◽

pp. 8256-8267 ◽

Cited By ~ 2

Author(s):

Jinming Wen ◽

Keyu Wu ◽

Chintha Tellambura ◽

Pingzhi Fan

Keyword(s):

Closed Form ◽

Error Rate ◽

Interference Cancellation ◽

Successive Interference Cancellation ◽

Word Error Rate ◽

Rate Analysis

Download Full-text

Towards Automatic Error Analysis of Machine Translation Output

Computational Linguistics ◽

10.1162/coli_a_00072 ◽

2011 ◽

Vol 37 (4) ◽

pp. 657-688 ◽

Cited By ~ 26

Author(s):

Maja Popović ◽

Hermann Ney

Keyword(s):

Error Analysis ◽

Machine Translation ◽

Error Rate ◽

Human Error ◽

Translation System ◽

Specific Information ◽

Error Type ◽

Word Error Rate ◽

Advantages And Disadvantages ◽

Automatic Error

Evaluation and error analysis of machine translation output are important but difficult tasks. In this article, we propose a framework for automatic error analysis and classification based on the identification of actual erroneous words using the algorithms for computation of Word Error Rate (WER) and Position-independent word Error Rate (PER), which is just a very first step towards development of automatic evaluation measures that provide more specific information of certain translation problems. The proposed approach enables the use of various types of linguistic knowledge in order to classify translation errors in many different ways. This work focuses on one possible set-up, namely, on five error categories: inflectional errors, errors due to wrong word order, missing words, extra words, and incorrect lexical choices. For each of the categories, we analyze the contribution of various POS classes. We compared the results of automatic error analysis with the results of human error analysis in order to investigate two possible applications: estimating the contribution of each error type in a given translation output in order to identify the main sources of errors for a given translation system, and comparing different translation outputs using the introduced error categories in order to obtain more information about advantages and disadvantages of different systems and possibilites for improvements, as well as about advantages and disadvantages of applied methods for improvements. We used Arabic–English Newswire and Broadcast News and Chinese–English Newswire outputs created in the framework of the GALE project, several Spanish and English European Parliament outputs generated during the TC-Star project, and three German–English outputs generated in the framework of the fourth Machine Translation Workshop. We show that our results correlate very well with the results of a human error analysis, and that all our metrics except the extra words reflect well the differences between different versions of the same translation system as well as the differences between different translation systems.

Download Full-text

Is a Wizard-of-Oz Required for Robot-Led Conversation Practice in a Second Language?

International Journal of Social Robotics ◽

10.1007/s12369-021-00849-8 ◽

2022 ◽

Author(s):

Olov Engwall ◽

José Lopes ◽

Ronald Cumbal

Keyword(s):

Second Language ◽

Speech Recognition ◽

Statistical Method ◽

Error Rate ◽

State Of The Art ◽

Autonomous Robot ◽

Language Learner ◽

Word Error Rate ◽

Wizard Of Oz ◽

Custom Made

AbstractThe large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.

Download Full-text

Indigenuous Vocabulary Reformulation for Continuousyorùbá Speech Recognition In M-Commerce Using Acoustic Nudging-Based Gaussian Mixture Model

10.21203/rs.3.rs-211622/v1 ◽

2021 ◽

Author(s):

Kehinde Lydia Ajayi ◽

Victor Azeta ◽

Isaac Odun-Ayo ◽

Ambrose Azeta ◽

Ajayi Peter Taiwo ◽

...

Keyword(s):

Speech Recognition ◽

Gaussian Mixture Model ◽

Mixture Model ◽

Error Rate ◽

System Performance ◽

Recognition Rate ◽

Gaussian Mixture ◽

Computer Applications ◽

Word Error Rate ◽

The Mean

Abstract One of the current research areas is speech recognition by aiding in the recognition of speech signals through computer applications. In this research paper, Acoustic Nudging, (AN) Model is used in re-formulating the persistence automatic speech recognition (ASR) errors that involves user’s acoustic irrational behavior which alters speech recognition accuracy. GMM helped in addressing low-resourced attribute of Yorùbá language to achieve better accuracy and system performance. From the simulated results given, it is observed that proposed Acoustic Nudging-based Gaussian Mixture Model (ANGM) improves accuracy and system performance which is evaluated based on Word Recognition Rate (WRR) and Word Error Rate (WER)given by validation accuracy, testing accuracy, and training accuracy. The evaluation results for the mean WRR accuracy achieved for the ANGM model is 95.277% and the mean Word Error Rate (WER) is 4.723%when compared to existing models. This approach thereby reduce error rate by 1.1%, 0.5%, 0.8%, 0.3%, and 1.4% when compared with other models. Therefore this work was able to discover a foundation for advancing current understanding of under-resourced languages and at the same time, development of accurate and precise model for speech recognition.

Download Full-text