Research on the Language Model according to the Recognition Unit for End-to-End Speech Recognition

We built a language model which is based on Transformer network architecture, used attention mechanisms to dispensing with recurrence and convalutions entirely. Through the transliteration of Tibetan to International Phonetic Alphabets, the language model was trained using the syllables and phonemes of the Tibetan word as modeling units to predict corresponding Tibetan sentences according to the context semantics of IPA. And it combined with the acoustic model as the Tibetan speech recognition was compared with end-to-end Tibetan speech recognition.

Download Full-text

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

10.21437/interspeech.2021-2075 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Yu Wu ◽

Naoyuki Kanda ◽

Liang Lu ◽

Xie Chen ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Word Error Rate ◽

Model Fusion ◽

End To End

Download Full-text

Development of Artificial Intelligence and Prospects for Its Application

Economics and Management ◽

10.35854/1998-1627-2021-2-132-138 ◽

2021 ◽

Vol 27 (2) ◽

pp. 132-138

Author(s):

V. Ya. Dmitriev ◽

T. A. Ignat'eva ◽

V. P. Pilyavskiy

Keyword(s):

Artificial Intelligence ◽

Speech Recognition ◽

Information And Communication Technologies ◽

Language Model ◽

Communication Technologies ◽

Conceptual Apparatus ◽

Information And Communication ◽

End To End ◽

Definition Of ◽

The Impact

Aim. To analyze the concept of “artificial intelligence”, to justify the effectiveness of using artificial intelligence technologies.Tasks. To study the conceptual apparatus; to propose and justify the author’s definition of the “artificial intelligence” concept; to describe the technology of speech recognition using artificial intelligence.Methodology. The authors used such general scientific methods of cognition as comparison, deduction and induction, analysis, generalization and systematization.Results. Based on a comparative analysis of the existing conceptual apparatus, it is concluded that there is no single concept of “artificial intelligence”. Each author puts his own vision into it. In this regard, the author’s definition of the “artificial intelligence” concept is formulated. It is determined that an important area of applying artificial intelligence technologies in various fields of activity is speech recognition technology. It is shown that the first commercially successful speech recognition prototypes appeared already by the 1990s, and since the beginning of the 21st century. The great interest in “end-to-end” automatic speech recognition has become obvious. While traditional phonetic approaches have requested pronunciation, acoustic, and language model data, end-to-end models simultaneously consider all components of speech recognition, thereby facilitating the stages of self-learning and development. It is established that a significant increase in the” mental “ capabilities of computer technology and the development of new algorithms have led to new achievements in this direction. These advances are driven by the growing demand for speech recognition.Conclusions. According to the authors, artificial intelligence is a complex of computer programs that duplicate the functions of the human brain, opening up the possibility of informal learning based on big data processing, allowing to solve the problems of pattern recognition (text, image, speech) and the formation of management decisions. Currently, the active development of information and communication technologies and artificial intelligence concepts has led to a wide practical application of intelligent technologies, especially in control systems. The impact of these systems can be found in the work of mobile phones and expert systems, in forecasting and other areas. Among the obstacles to the development of this technology is the lack of accuracy in speech and voice recognition systems in the conditions of sound interference, which is always present in the external environment. However, the recent advances overcome this disadvantage.

Download Full-text

A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) ◽

10.1109/asru46091.2019.9003790 ◽

2019 ◽

Author(s):

Erik McDermott ◽

Hasim Sak ◽

Ehsan Variani

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Language Model ◽

Density Ratio ◽

Model Fusion ◽

Ratio Approach ◽

End To End

Download Full-text

Encoder-decoder models for recognition of Russian speech

Information and Control Systems ◽

10.31799/1684-8853-2019-4-45-53 ◽

2019 ◽

pp. 45-53

Author(s):

Nikita Markovnikov ◽

Irina Kipyatkova

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Recognition Accuracy ◽

Language Model ◽

Hybrid Models ◽

Attention Mechanism ◽

Russian Language ◽

End To End ◽

The Russian Language ◽

Decoding Speed

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

Download Full-text

Location-Based End-to-End Speech Recognition with Multiple Language Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019975 ◽

2019 ◽

Vol 33 ◽

pp. 9975-9976

Author(s):

Zhijie Lin ◽

Kaiyang Lin ◽

Shiling Chen ◽

Linlin Li ◽

Zhou Zhao

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Error Correction ◽

Automatic Speech Recognition ◽

Language Model ◽

Language Models ◽

Learning Approaches ◽

Semantic Error ◽

End To End

End-to-End deep learning approaches for Automatic Speech Recognition (ASR) has been a new trend. In those approaches, starting active in many areas, language model can be considered as an important and effective method for semantic error correction. Many existing systems use one language model. In this paper, however, multiple language models (LMs) are applied into decoding. One LM is used for selecting appropriate answers and others, considering both context and grammar, for further decision. Experiment on a general location-based dataset show the effectiveness of our method.

Download Full-text