Encoder-decoder models for recognition of Russian speech

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

Download Full-text

Combination of End-to-End and Hybrid Models for Speech Recognition

10.21437/interspeech.2020-2141 ◽

2020 ◽

Author(s):

Jeremy H.M. Wong ◽

Yashesh Gaur ◽

Rui Zhao ◽

Liang Lu ◽

Eric Sun ◽

...

Keyword(s):

Speech Recognition ◽

Hybrid Models ◽

End To End

Download Full-text

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

2021 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt48900.2021.9383515 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Sarangarajan Parthasarathy ◽

Eric Sun ◽

Yashesh Gaur ◽

Naoyuki Kanda ◽

...

Keyword(s):

Speech Recognition ◽

Language Model ◽

Model Estimation ◽

End To End

Download Full-text

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8683602 ◽

2019 ◽

Cited By ~ 5

Author(s):

Alexander H. Liu ◽

Hung-yi Lee ◽

Lin-shan Lee

Keyword(s):

Speech Recognition ◽

Language Model ◽

Adversarial Training ◽

End To End

Download Full-text

A language model for Amdo Tibetan speech recognition

MATEC Web of Conferences ◽

10.1051/matecconf/202133606016 ◽

2021 ◽

Vol 336 ◽

pp. 06016

Author(s):

Taiben Suan ◽

Rangzhuoma Cai ◽

Zhijie Cai ◽

Ba Zu ◽

Baojia Gong

Keyword(s):

Speech Recognition ◽

Network Architecture ◽

Language Model ◽

Acoustic Model ◽

End To End

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.

Download Full-text

How does language model size effects speech recognition accuracy for the Turkish language?

Pamukkale University Journal of Engineering Sciences ◽

10.5505/pajes.2016.03371 ◽

2016 ◽

Vol 22 (2) ◽

pp. 100-105

Author(s):

Behnam Asefisaray ◽

Erhan Mengüşoğlu ◽

Murat Hacıömeroğlu ◽

Hayri Sever

Keyword(s):

Speech Recognition ◽

Size Effects ◽

Recognition Accuracy ◽

Language Model ◽

Model Size ◽

Turkish Language

Download Full-text

VECTOR REPRESENTATION OF WORDS OF THE RUSSIAN LANGUAGE WITH THE USE OF NEURAL NETWORK MODELS OF CONVOLUTIONAL AUTOENCODER

Современные наукоемкие технологии (Modern High Technologies) ◽

10.17513/snt.38954 ◽

2021 ◽

Vol 1 (№12 2021) ◽

pp. 52-59

Author(s):

A.Yu. Likhachev ◽

A.B. Trubyanov

Keyword(s):

Neural Network ◽

Network Models ◽

Russian Language ◽

Vector Representation ◽

Neural Network Models ◽

Convolutional Autoencoder ◽

The Russian Language

Download Full-text

Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

Applied Sciences ◽

10.3390/app9214639 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4639 ◽

Cited By ~ 3

Author(s):

Long Wu ◽

Ta Li ◽

Li Wang ◽

Yonghong Yan

Keyword(s):

Speech Recognition ◽

Proper Time ◽

Window Size ◽

Attention Mechanism ◽

Wall Street ◽

Word Error Rate ◽

Perceptual Evaluation ◽

Left And Right ◽

End To End ◽

Multiparty Interaction

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.

Download Full-text