Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.

Download Full-text

Research on the Language Model according to the Recognition Unit for End-to-End Speech Recognition

KIISE Transactions on Computing Practices ◽

10.5626/ktcp.2021.27.6.255 ◽

2021 ◽

Vol 27 (6) ◽

pp. 255-262

Author(s):

Hyungbae Jeon ◽

Byung Ok Kang ◽

Hoon Chung ◽

Yoo Rhee Oh ◽

Yun Kyung Lee ◽

...

Keyword(s):

Speech Recognition ◽

Language Model ◽

End To End

Download Full-text

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Sensors ◽

10.3390/s20071809 ◽

2020 ◽

Vol 20 (7) ◽

pp. 1809

Author(s):

Long Zhang ◽

Ziping Zhao ◽

Chunmei Ma ◽

Linlin Shan ◽

Huazhi Sun ◽

...

Keyword(s):

Neural Network ◽

Error Detection ◽

Deep Neural Network ◽

State Of The Art ◽

Language Model ◽

Computer Assisted ◽

Learning Technology ◽

End To End ◽

Asr System ◽

Connectionist Temporal Classification

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.

Download Full-text

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

10.21437/interspeech.2021-2075 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Yu Wu ◽

Naoyuki Kanda ◽

Liang Lu ◽

Xie Chen ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Word Error Rate ◽

Model Fusion ◽

End To End

Download Full-text

Development of Artificial Intelligence and Prospects for Its Application

Economics and Management ◽

10.35854/1998-1627-2021-2-132-138 ◽

2021 ◽

Vol 27 (2) ◽

pp. 132-138

Author(s):

V. Ya. Dmitriev ◽

T. A. Ignat'eva ◽

V. P. Pilyavskiy

Keyword(s):

Artificial Intelligence ◽

Speech Recognition ◽

Information And Communication Technologies ◽

Language Model ◽

Communication Technologies ◽

Conceptual Apparatus ◽

Information And Communication ◽

End To End ◽

Definition Of ◽

The Impact

Aim. To analyze the concept of “artificial intelligence”, to justify the effectiveness of using artificial intelligence technologies.Tasks. To study the conceptual apparatus; to propose and justify the author’s definition of the “artificial intelligence” concept; to describe the technology of speech recognition using artificial intelligence.Methodology. The authors used such general scientific methods of cognition as comparison, deduction and induction, analysis, generalization and systematization.Results. Based on a comparative analysis of the existing conceptual apparatus, it is concluded that there is no single concept of “artificial intelligence”. Each author puts his own vision into it. In this regard, the author’s definition of the “artificial intelligence” concept is formulated. It is determined that an important area of applying artificial intelligence technologies in various fields of activity is speech recognition technology. It is shown that the first commercially successful speech recognition prototypes appeared already by the 1990s, and since the beginning of the 21st century. The great interest in “end-to-end” automatic speech recognition has become obvious. While traditional phonetic approaches have requested pronunciation, acoustic, and language model data, end-to-end models simultaneously consider all components of speech recognition, thereby facilitating the stages of self-learning and development. It is established that a significant increase in the” mental “ capabilities of computer technology and the development of new algorithms have led to new achievements in this direction. These advances are driven by the growing demand for speech recognition.Conclusions. According to the authors, artificial intelligence is a complex of computer programs that duplicate the functions of the human brain, opening up the possibility of informal learning based on big data processing, allowing to solve the problems of pattern recognition (text, image, speech) and the formation of management decisions. Currently, the active development of information and communication technologies and artificial intelligence concepts has led to a wide practical application of intelligent technologies, especially in control systems. The impact of these systems can be found in the work of mobile phones and expert systems, in forecasting and other areas. Among the obstacles to the development of this technology is the lack of accuracy in speech and voice recognition systems in the conditions of sound interference, which is always present in the external environment. However, the recent advances overcome this disadvantage.

Download Full-text

A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) ◽

10.1109/asru46091.2019.9003790 ◽

2019 ◽

Author(s):

Erik McDermott ◽

Hasim Sak ◽

Ehsan Variani

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Language Model ◽

Density Ratio ◽

Model Fusion ◽

Ratio Approach ◽

End To End

Download Full-text

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

Information ◽

10.3390/info12020062 ◽

2021 ◽

Vol 12 (2) ◽

pp. 62 ◽

Cited By ~ 1

Author(s):

Eshete Derb Emiru ◽

Shengwu Xiong ◽

Yaxing Li ◽

Awet Fesseha ◽

Moussa Diallo

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Data Augmentation ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Attention Model ◽

Recognition Systems ◽

End To End ◽

Connectionist Temporal Classification

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.

Download Full-text

Encoder-decoder models for recognition of Russian speech

Information and Control Systems ◽

10.31799/1684-8853-2019-4-45-53 ◽

2019 ◽

pp. 45-53

Author(s):

Nikita Markovnikov ◽

Irina Kipyatkova

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Recognition Accuracy ◽

Language Model ◽

Hybrid Models ◽

Attention Mechanism ◽

Russian Language ◽

End To End ◽

The Russian Language ◽

Decoding Speed

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

Download Full-text

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

A language model for Amdo Tibetan speech recognition

Research on the Language Model according to the Recognition Unit for End-to-End Speech Recognition

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Development of Artificial Intelligence and Prospects for Its Application

A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

Encoder-decoder models for recognition of Russian speech

Export Citation Format