Combination of End-to-End and Hybrid Models for Speech Recognition

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

Selective Adaptation of End-to-End Speech Recognition using Hybrid CTC/Attention Architecture for Noise Robustness

2020 28th European Signal Processing Conference (EUSIPCO) ◽

10.23919/eusipco47968.2020.9287836 ◽

2021 ◽

Author(s):

Cong-Thanh Do ◽

Shucong Zhang ◽

Thomas Hain

Keyword(s):

Speech Recognition ◽

Selective Adaptation ◽

Noise Robustness ◽

End To End

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition

10.21437/interspeech.2018-1030 ◽

2018 ◽

Cited By ~ 11

Author(s):

Chao Weng ◽

Jia Cui ◽

Guangsen Wang ◽

Jun Wang ◽

Chengzhu Yu ◽

...

Keyword(s):

Speech Recognition ◽

Conversational Speech ◽

End To End

Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

10.21437/interspeech.2020-1930 ◽

2020 ◽

Author(s):

Ryo Masumura ◽

Naoki Makishima ◽

Mana Ihori ◽

Akihiko Takashima ◽

Tomohiro Tanaka ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Large Scale ◽

End To End

Active Learning Methods for Low Resource End-to-End Speech Recognition

10.21437/interspeech.2019-2316 ◽

2019 ◽

Cited By ~ 2

Author(s):

Karan Malhotra ◽

Shubham Bansal ◽

Sriram Ganapathy

Keyword(s):

Speech Recognition ◽

Active Learning ◽

Learning Methods ◽

Low Resource ◽

End To End

Large Margin Training for Attention Based End-to-End Speech Recognition

10.21437/interspeech.2019-1680 ◽

2019 ◽

Author(s):

Peidong Wang ◽

Jia Cui ◽

Chao Weng ◽

Dong Yu

Keyword(s):

Speech Recognition ◽

Large Margin ◽

End To End

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2020.3039600 ◽

2021 ◽

Vol 29 ◽

pp. 198-209

Author(s):

Cunhang Fan ◽

Jiangyan Yi ◽

Jianhua Tao ◽

Zhengkun Tian ◽

Bin Liu ◽

...

Keyword(s):

Speech Recognition ◽

Joint Training ◽

End To End

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

2021 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt48900.2021.9383515 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Sarangarajan Parthasarathy ◽

Eric Sun ◽

Yashesh Gaur ◽

Naoyuki Kanda ◽

...

Keyword(s):

Speech Recognition ◽

Language Model ◽

Model Estimation ◽

End To End

Transformer-Based End-to-End Speech Recognition with Local Dense Synthesizer Attention

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414353 ◽

2021 ◽

Author(s):

Menglong Xu ◽

Shengqiang Li ◽

Xiao-Lei Zhang

Keyword(s):

Speech Recognition ◽

End To End

Fine-Tuning of Pre-Trained End-to-End Speech Recognition with Generative Adversarial Networks

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9413703 ◽

2021 ◽

Author(s):

Md. Akmal Haidar ◽

Mehdi Rezagholizadeh

Keyword(s):

Speech Recognition ◽

Fine Tuning ◽

Generative Adversarial Networks ◽

Adversarial Networks ◽

End To End