Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation

10.1145/3474085.3481546 ◽

2021 ◽

Author(s):

Wei Zhang ◽

Lingxiao He ◽

Peng Chen ◽

Xingyu Liao ◽

Wu Liu ◽

...

Keyword(s):

Object Tracking ◽

Person Search ◽

Knowledge Distillation ◽

End To End

Download Full-text

End-to-end spoofing speech detection and knowledge distillation under noisy conditions

10.1109/ijcnn52387.2021.9534312 ◽

2021 ◽

Author(s):

Pengfei Liu ◽

Zhenchuan Zhang ◽

Yingchun Yang

Keyword(s):

Speech Detection ◽

Knowledge Distillation ◽

Noisy Conditions ◽

End To End

Download Full-text

Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition

2018 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt.2018.8639629 ◽

2018 ◽

Cited By ~ 5

Author(s):

Gakuto Kurata ◽

Kartik Audhkhasi

Keyword(s):

Speech Recognition ◽

Knowledge Distillation ◽

End To End

Download Full-text

End-To-End Voice Conversion Via Cross-Modal Knowledge Distillation for Dysarthric Speech Reconstruction

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9054596 ◽

2020 ◽

Cited By ~ 1

Author(s):

Disong Wang ◽

Jianwei Yu ◽

Xixin Wu ◽

Songxiang Liu ◽

Lifa Sun ◽

...

Keyword(s):

Voice Conversion ◽

Knowledge Distillation ◽

Dysarthric Speech ◽

End To End ◽

Speech Reconstruction

Download Full-text

Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8682775 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ho-Gyeong Kim ◽

Hwidong Na ◽

Hoshik Lee ◽

Jihyun Lee ◽

Tae Gyoon Kang ◽

...

Keyword(s):

Knowledge Distillation ◽

End To End

Download Full-text

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing

10.21437/interspeech.2020-3172 ◽

2020 ◽

Author(s):

Abhinav Garg ◽

Gowtham P. Vadisetti ◽

Dhananjaya Gowda ◽

Sichen Jin ◽

Aditya Jayasimha ◽

...

Keyword(s):

End To End ◽

Asr System

Download Full-text

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Sensors ◽

10.3390/s20071809 ◽

2020 ◽

Vol 20 (7) ◽

pp. 1809

Author(s):

Long Zhang ◽

Ziping Zhao ◽

Chunmei Ma ◽

Linlin Shan ◽

Huazhi Sun ◽

...

Keyword(s):

Neural Network ◽

Error Detection ◽

Deep Neural Network ◽

State Of The Art ◽

Language Model ◽

Computer Assisted ◽

Learning Technology ◽

End To End ◽

Asr System ◽

Connectionist Temporal Classification

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

Download Full-text