End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

Download Full-text

End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8683573 ◽

2019 ◽

Cited By ~ 4

Author(s):

Zhehuai Chen ◽

Mahaveer Jain ◽

Yongqiang Wang ◽

Michael L. Seltzer ◽

Christian Fuegen

Keyword(s):

Speech Recognition ◽

Language Models ◽

Token Passing ◽

End To End

Download Full-text

End-to-end Speech Recognition With Word-Based Rnn Language Models

2018 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt.2018.8639693 ◽

2018 ◽

Cited By ~ 14

Author(s):

Takaaki Hori ◽

Jaejin Cho ◽

Shinji Watanabe

Keyword(s):

Speech Recognition ◽

Language Models ◽

End To End

Download Full-text

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

10.21437/interspeech.2021-1162 ◽

2021 ◽

Author(s):

Jianwei Sun ◽

Zhiyuan Tang ◽

Hengxin Yin ◽

Wei Wang ◽

Xi Zhao ◽

...

Keyword(s):

Speech Recognition ◽

Data Augmentation ◽

Semantic Data ◽

End To End ◽

Mandarin Speech Recognition

Download Full-text

Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition

Computer Speech & Language ◽

10.1006/csla.1998.0118 ◽

1999 ◽

Vol 13 (2) ◽

pp. 125-141 ◽

Cited By ~ 13

Author(s):

Z. GuoDong ◽

L. KimTeng

Keyword(s):

Speech Recognition ◽

Mutual Information ◽

Language Models ◽

Mandarin Speech Recognition ◽

N Gram

Download Full-text

End-to-End Mandarin Speech Recognition Using Bidirectional Long Short-Term Memory Network

Advances in Intelligent Systems and Computing - Recent Developments in Mechatronics and Intelligent Robotics ◽

10.1007/978-3-030-00214-5_91 ◽

2018 ◽

pp. 726-735

Author(s):

Yu Yao ◽

Ryad Chellali

Keyword(s):

Speech Recognition ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Memory Network ◽

Long Short Term Memory ◽

End To End ◽

Mandarin Speech Recognition

Download Full-text

KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition

Applied Sciences ◽

10.3390/app10196936 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6936 ◽

Cited By ~ 1

Author(s):

Jeong-Uk Bang ◽

Seung Yun ◽

Seung-Hi Kim ◽

Mu-Yeol Choi ◽

Min-Kyu Lee ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Large Scale ◽

Open Data ◽

Spontaneous Speech ◽

Open Domain ◽

Speech Corpus ◽

Clean Environment ◽

End To End ◽

Repeated Words

This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.

Download Full-text

Location-Based End-to-End Speech Recognition with Multiple Language Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019975 ◽

2019 ◽

Vol 33 ◽

pp. 9975-9976

Author(s):

Zhijie Lin ◽

Kaiyang Lin ◽

Shiling Chen ◽

Linlin Li ◽

Zhou Zhao

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Error Correction ◽

Automatic Speech Recognition ◽

Language Model ◽

Language Models ◽

Learning Approaches ◽

Semantic Error ◽

End To End

End-to-End deep learning approaches for Automatic Speech Recognition (ASR) has been a new trend. In those approaches, starting active in many areas, language model can be considered as an important and effective method for semantic error correction. Many existing systems use one language model. In this paper, however, multiple language models (LMs) are applied into decoding. One LM is used for selecting appropriate answers and others, considering both context and grammar, for further decision. Experiment on a general location-based dataset show the effectiveness of our method.

Download Full-text

Syllable-Based Indonesian Automatic Speech Recognition

International Journal on Electrical Engineering and Informatics ◽

10.15676/ijeei.2020.12.4.2 ◽

2020 ◽

Vol 12 (4) ◽

pp. 720-728

Author(s):

Danny Henry Galatang ◽

◽

Suyanto Suyanto ◽

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

The State ◽

Speech Corpus ◽

Advanced Method ◽

Acoustic Models ◽

The Future ◽

End To End ◽

Better Than

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.

Download Full-text