Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

10.21437/interspeech.2021-2075 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Yu Wu ◽

Naoyuki Kanda ◽

Liang Lu ◽

Xie Chen ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Word Error Rate ◽

Model Fusion ◽

End To End

Download Full-text

Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition

10.21437/interspeech.2020-1557 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jinxi Guo ◽

Gautam Tiwari ◽

Jasha Droppo ◽

Maarten Van Segbroeck ◽

Che-Wei Huang ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Word Error Rate ◽

End To End

Download Full-text

Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00233-4 ◽

2022 ◽

Vol 2022 (1) ◽

Author(s):

Siqing Qin ◽

Longbiao Wang ◽

Sheng Li ◽

Jianwu Dang ◽

Lixin Pan

Keyword(s):

Speech Recognition ◽

Transfer Learning ◽

Recognition Performance ◽

Low Resource ◽

Learning Framework ◽

Positive Effects ◽

End To End ◽

Historical Heritage ◽

First Time ◽

Asr System

AbstractConventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers and transcriptions. Hence, it is meaningful to apply the ASR technique to the Lhasa dialect for historical heritage protection and cultural exchange. Previous work on Tibetan speech recognition focused on selecting phone-level acoustic modeling units and incorporating tonal information but underestimated the influence of limited data. The purpose of this paper is to improve the speech recognition performance of the low-resource Lhasa dialect by adopting multilingual speech recognition technology on the E2E structure based on the transfer learning framework. Using transfer learning, we first establish a monolingual E2E ASR system for the Lhasa dialect with different source languages to initialize the ASR model to compare the positive effects of source languages on the Tibetan ASR model. We further propose a multilingual E2E ASR system by utilizing initialization strategies with different source languages and multilevel units, which is proposed for the first time. Our experiments show that the performance of the proposed method-based ASR system exceeds that of the E2E baseline ASR system. Our proposed method effectively models the low-resource Lhasa dialect and achieves a relative 14.2% performance improvement in character error rate (CER) compared to DNN-HMM systems. Moreover, from the best monolingual E2E model to the best multilingual E2E model of the Lhasa dialect, the system’s performance increased by 8.4% in CER.

Download Full-text

Development of End – to – End Encoder - Decoder Model Applying Voice Recognition System in Different Channels

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1267.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2350-2352

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Voice Recognition ◽

Ground Truth ◽

Recognition System ◽

Training Algorithms ◽

Word Error Rate ◽

End To End ◽

Evaluation Metric

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.

Download Full-text

Active Learning Methods for Low Resource End-to-End Speech Recognition

10.21437/interspeech.2019-2316 ◽

2019 ◽

Cited By ~ 2

Author(s):

Karan Malhotra ◽

Shubham Bansal ◽

Sriram Ganapathy

Keyword(s):

Speech Recognition ◽

Active Learning ◽

Learning Methods ◽

Low Resource ◽

End To End

Download Full-text

Improvements to the LIUM French ASR system based on CMU sphinx: what helps to significantly reduce the word error rate?

10.21437/interspeech.2009-607 ◽

2009 ◽

Author(s):

Paul Deléglise ◽

Yannick Estève ◽

Sylvain Meignier ◽

Teva Merlin

Keyword(s):

Error Rate ◽

Word Error Rate ◽

Asr System

Download Full-text

Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

Applied Sciences ◽

10.3390/app9214639 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4639 ◽

Cited By ~ 3

Author(s):

Long Wu ◽

Ta Li ◽

Li Wang ◽

Yonghong Yan

Keyword(s):

Speech Recognition ◽

Proper Time ◽

Window Size ◽

Attention Mechanism ◽

Wall Street ◽

Word Error Rate ◽

Perceptual Evaluation ◽

Left And Right ◽

End To End ◽

Multiparty Interaction

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.

Download Full-text

Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language

Symmetry ◽

10.3390/sym11020179 ◽

2019 ◽

Vol 11 (2) ◽

pp. 179 ◽

Cited By ~ 4

Author(s):

Chongchong Yu ◽

Yunbing Chen ◽

Yueqiao Li ◽

Meng Kang ◽

Shixuan Xu ◽

...

Keyword(s):

Speech Recognition ◽

Transfer Learning ◽

Short Term Memory ◽

Recognition System ◽

Language Recognition ◽

Low Resource ◽

End To End ◽

The Cross ◽

Hidden Layer ◽

Cross Language

To rescue and preserve an endangered language, this paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language. From the perspective of the Tujia language international phonetic alphabet (IPA) label layer, using Chinese corpus as an extension of the Tujia language can effectively solve the problem of an insufficient corpus in the Tujia language, constructing a cross-language corpus and an IPA dictionary that is unified between the Chinese and Tujia languages. The convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM) network were used to extract the cross-language acoustic features and train shared hidden layer weights for the Tujia language and Chinese phonetic corpus. In addition, the automatic speech recognition function of the Tujia language was realized using the end-to-end method that consists of symmetric encoding and decoding. Furthermore, transfer learning was used to establish the model of the cross-language end-to-end Tujia language recognition system. The experimental results showed that the recognition error rate of the proposed model is 46.19%, which is 2.11% lower than the that of the model that only used the Tujia language data for training. Therefore, this approach is feasible and effective.

Download Full-text

Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings

2018 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt.2018.8639506 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lahiru Samarakoon ◽

Brian Mak ◽

Albert Y.S. Lam

Keyword(s):

Speech Recognition ◽

Domain Adaptation ◽

Low Resource Settings ◽

Low Resource ◽

End To End

Download Full-text

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-019-0161-0 ◽

2019 ◽

Vol 2019 (1) ◽

Cited By ~ 2

Author(s):

Chu-Xiong Qin ◽

Wen-Lin Zhang ◽

Dan Qu

Keyword(s):

Speech Recognition ◽

Nonnegative Matrix Factorization ◽

State Of The Art ◽

Nonnegative Matrix ◽

Attention Mechanism ◽

Word Error Rate ◽

Absolute Value ◽

Multi Level ◽

End To End ◽

High Level

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

Download Full-text