Extremely Low Footprint End-to-End ASR System for Smart Device

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing

10.21437/interspeech.2020-3172 ◽

2020 ◽

Author(s):

Abhinav Garg ◽

Gowtham P. Vadisetti ◽

Dhananjaya Gowda ◽

Sichen Jin ◽

Aditya Jayasimha ◽

...

Keyword(s):

End To End ◽

Asr System

Download Full-text

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Sensors ◽

10.3390/s20071809 ◽

2020 ◽

Vol 20 (7) ◽

pp. 1809

Author(s):

Long Zhang ◽

Ziping Zhao ◽

Chunmei Ma ◽

Linlin Shan ◽

Huazhi Sun ◽

...

Keyword(s):

Neural Network ◽

Error Detection ◽

Deep Neural Network ◽

State Of The Art ◽

Language Model ◽

Computer Assisted ◽

Learning Technology ◽

End To End ◽

Asr System ◽

Connectionist Temporal Classification

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

Download Full-text

Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System

10.21437/interspeech.2021-1756 ◽

2021 ◽

Author(s):

Ekaterina Egorova ◽

Hari Krishna Vydana ◽

Lukáš Burget ◽

Jan Černocký

Keyword(s):

End To End ◽

Asr System

Download Full-text

Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System

10.21437/interspeech.2019-3192 ◽

2019 ◽

Cited By ~ 2

Author(s):

Wangyou Zhang ◽

Xuankai Chang ◽

Yanmin Qian

Keyword(s):

Knowledge Distillation ◽

End To End ◽

Asr System

Download Full-text

End-to-end Monaural Multi-speaker ASR System without Pretraining

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8682822 ◽

2019 ◽

Cited By ~ 10

Author(s):

Xuankai Chang ◽

Yanmin Qian ◽

Kai Yu ◽

Shinji Watanabe

Keyword(s):

End To End ◽

Asr System

Download Full-text

Low-Frequency Character Clustering for End-to-End ASR System

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.23919/apsipa.2018.8659735 ◽

2018 ◽

Author(s):

Hitoshi Ito ◽

Aiko Hagiwara ◽

Manon Ichiki ◽

Takeshi Kobayakawa ◽

Takeshi Mishima ◽

...

Keyword(s):

Low Frequency ◽

End To End ◽

Frequency Character ◽

Asr System

Download Full-text

Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00233-4 ◽

2022 ◽

Vol 2022 (1) ◽

Author(s):

Siqing Qin ◽

Longbiao Wang ◽

Sheng Li ◽

Jianwu Dang ◽

Lixin Pan

Keyword(s):

Speech Recognition ◽

Transfer Learning ◽

Recognition Performance ◽

Low Resource ◽

Learning Framework ◽

Positive Effects ◽

End To End ◽

Historical Heritage ◽

First Time ◽

Asr System

AbstractConventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers and transcriptions. Hence, it is meaningful to apply the ASR technique to the Lhasa dialect for historical heritage protection and cultural exchange. Previous work on Tibetan speech recognition focused on selecting phone-level acoustic modeling units and incorporating tonal information but underestimated the influence of limited data. The purpose of this paper is to improve the speech recognition performance of the low-resource Lhasa dialect by adopting multilingual speech recognition technology on the E2E structure based on the transfer learning framework. Using transfer learning, we first establish a monolingual E2E ASR system for the Lhasa dialect with different source languages to initialize the ASR model to compare the positive effects of source languages on the Tibetan ASR model. We further propose a multilingual E2E ASR system by utilizing initialization strategies with different source languages and multilevel units, which is proposed for the first time. Our experiments show that the performance of the proposed method-based ASR system exceeds that of the E2E baseline ASR system. Our proposed method effectively models the low-resource Lhasa dialect and achieves a relative 14.2% performance improvement in character error rate (CER) compared to DNN-HMM systems. Moreover, from the best monolingual E2E model to the best multilingual E2E model of the Lhasa dialect, the system’s performance increased by 8.4% in CER.

Download Full-text