Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Author(s):  
Shigeki Karita ◽  
Nelson Enrique Yalta Soplin ◽  
Shinji Watanabe ◽  
Marc Delcroix ◽  
Atsunori Ogawa ◽  
...  
Author(s):  
Zhong Meng ◽  
Sarangarajan Parthasarathy ◽  
Eric Sun ◽  
Yashesh Gaur ◽  
Naoyuki Kanda ◽  
...  

2021 ◽  
Vol 336 ◽  
pp. 06016
Author(s):  
Taiben Suan ◽  
Rangzhuoma Cai ◽  
Zhijie Cai ◽  
Ba Zu ◽  
Baojia Gong

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.


2021 ◽  
Vol 27 (6) ◽  
pp. 255-262
Author(s):  
Hyungbae Jeon ◽  
Byung Ok Kang ◽  
Hoon Chung ◽  
Yoo Rhee Oh ◽  
Yun Kyung Lee ◽  
...  

Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 1809
Author(s):  
Long Zhang ◽  
Ziping Zhao ◽  
Chunmei Ma ◽  
Linlin Shan ◽  
Huazhi Sun ◽  
...  

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.


2021 ◽  
Author(s):  
Zhong Meng ◽  
Yu Wu ◽  
Naoyuki Kanda ◽  
Liang Lu ◽  
Xie Chen ◽  
...  

2021 ◽  
Vol 27 (2) ◽  
pp. 132-138
Author(s):  
V. Ya. Dmitriev ◽  
T. A. Ignat'eva ◽  
V. P. Pilyavskiy

Aim. To analyze the concept of “artificial intelligence”, to justify the effectiveness of using artificial intelligence technologies.Tasks. To study the conceptual apparatus; to propose and justify the author’s definition of the “artificial intelligence” concept; to describe the technology of speech recognition using artificial intelligence.Methodology. The authors used such general scientific methods of cognition as comparison, deduction and induction, analysis, generalization and systematization.Results. Based on a comparative analysis of the existing conceptual apparatus, it is concluded that there is no single concept of “artificial intelligence”. Each author puts his own vision into it. In this regard, the author’s definition of the “artificial intelligence” concept is formulated. It is determined that an important area of applying artificial intelligence technologies in various fields of activity is speech recognition technology. It is shown that the first commercially successful speech recognition prototypes appeared already by the 1990s, and since the beginning of the 21st century. The great interest in “end-to-end” automatic speech recognition has become obvious. While traditional phonetic approaches have requested pronunciation, acoustic, and language model data, end-to-end models simultaneously consider all components of speech recognition, thereby facilitating the stages of self-learning and development. It is established that a significant increase in the” mental “ capabilities of computer technology and the development of new algorithms have led to new achievements in this direction. These advances are driven by the growing demand for speech recognition.Conclusions. According to the authors, artificial intelligence is a complex of computer programs that duplicate the functions of the human brain, opening up the possibility of informal learning based on big data processing, allowing to solve the problems of pattern recognition (text, image, speech) and the formation of management decisions. Currently, the active development of information and communication technologies and artificial intelligence concepts has led to a wide practical application of intelligent technologies, especially in control systems. The impact of these systems can be found in the work of mobile phones and expert systems, in forecasting and other areas. Among the obstacles to the development of this technology is the lack of accuracy in speech and voice recognition systems in the conditions of sound interference, which is always present in the external environment. However, the recent advances overcome this disadvantage.


Information ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 62 ◽  
Author(s):  
Eshete Derb Emiru ◽  
Shengwu Xiong ◽  
Yaxing Li ◽  
Awet Fesseha ◽  
Moussa Diallo

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.


Author(s):  
Nikita Markovnikov ◽  
Irina Kipyatkova

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.


Sign in / Sign up

Export Citation Format

Share Document