Development of End – to – End Encoder - Decoder Model Applying Voice Recognition System in Different Channels

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.

Download Full-text

Langkah Praktis Membangun Sistem Pengenalan Suara dengan HTK

JSAI (Journal Scientific and Applied Informatics) ◽

10.36085/jsai.v2i2.314 ◽

2019 ◽

Vol 2 (2) ◽

pp. 149-153

Author(s):

Zulkarnaen Hatala

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Principle Component Analysis ◽

Hidden Markov ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Word Error Rate ◽

Bahasa Indonesia

Dipaparkan prosedur untuk mengembangkan Sistem Pengenalan Suara otomatis, Automatic Speech Recognition System (ASR) untuk kasus online recognition. Prosedur ini secara cepat dan efisien membangun ASR menggunakan Hidden Markov Toolkit (HTK). Langkah-langkah praktis ini dipaparkan secara jelas untuk mengimplementasikan ASR dengan daftar kata sedikit (Small Vocabulary) dalam contoh kasus pengenalan digit Bahasa Indonesia. Dijelaskan beberapa teknik meningkatkan performansi seperti cara mengatasi noise, pengejaan ganda dan penerapan Principle Component Analysis. Hasil akhir berupa Word Error Rate

Download Full-text

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Sensors ◽

10.3390/s21093063 ◽

2021 ◽

Vol 21 (9) ◽

pp. 3063

Author(s):

Aleksandr Laptev ◽

Andrei Andrusenko ◽

Ivan Podluzhny ◽

Anton Mitrofanov ◽

Ivan Medennikov ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Rapid Development ◽

Computational Cost ◽

Vocabulary Size ◽

Word Error Rate ◽

Low Resource ◽

Steady Improvement ◽

End To End ◽

Asr System

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Speech Vision: An End-to-End Deep Learning-based Dysarthric Automatic Speech Recognition System

IEEE Transactions on Neural Systems and Rehabilitation Engineering ◽

10.1109/tnsre.2021.3076778 ◽

2021 ◽

pp. 1-1

Author(s):

Seyed Reza Shahamiri

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

End To End

Download Full-text

Konversi Suara Ucapan Bahasa Indonesia Ke Sistem Bahasa Isyarat Indonesia (Sibi)

Ainet : Jurnal Informatika ◽

10.26618/ainet.v2i2.4025 ◽

2020 ◽

Vol 2 (2) ◽

pp. 7-13

Author(s):

Andi Nasri

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Word Error Rate ◽

Bahasa Indonesia

Dengan semakin berkembangnya teknologi speech recognition, berbagai software yang bertujuan untuk memudahkan orang tunarungu dalam berkomunikasi dengan yang lainnya telah dikembangkan. Sistem tersebut menterjemahkan suara ucapan menjadi bahasa isyarat atau sebaliknya bahasa isyarat diterjemahkan ke suara ucapan. Sistem tersebut sudah dikembangkan dalam berbagai bahasa seperti bahasa Inggris, Arab, Spanyol, Meksiko, Indonesia dan lain-lain. Khusus untuk bahasa Indonesia mulai juga sudah yang mencoba melakukan penelitian untuk membuat system seperti tersebut. Namun system yang dibuat masih terbatas pada Automatic Speech Recognition (ASR) yang digunakan dimana mempunyai kosa-kata yang terbatas. Dalam penelitian ini bertujuan untuk mengembangkan sistem penterjemah suara ucapan bahasa Indonesia ke Sistem Bahasa Isyarat Indonesia (SIBI) dengan data korpus yang lebih besar dan meggunkanan continue speech recognition untuk meningkatkan akurasi system.Dari hasil pengujian system menunjukan diperoleh hasil akurasi sebesar rata-rata 90,50 % dan Word Error Rate (WER) 9,50%. Hasil akurasi lebih tinggi dibandingkan penelitian kedua 48,75% dan penelitan pertama 66,67%. Disamping itu system juga dapat mengenali kata yang diucapkan secara kontinyu atau pengucapan kalimat. Kemudian hasil pengujian kinerja system mencapai 0,83 detik untuk Speech to Text dan 8,25 detik untuk speech to sign.

Download Full-text

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

10.21437/interspeech.2021-2075 ◽

2021 ◽

Author(s):

Zhong Meng ◽

Yu Wu ◽

Naoyuki Kanda ◽

Liang Lu ◽

Xie Chen ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Word Error Rate ◽

Model Fusion ◽

End To End

Download Full-text

Đánh giá các hệ thống nhận dạng giọng nói tiếng việt (vais, viettel, zalo, fpt và google) trong bản tin

Journal of Technical Education Science ◽

10.54644/jte.63.2021.46 ◽

2021 ◽

pp. 28-35

Author(s):

Nguyen Thi My Thanh ◽

Phan Xuan Dung ◽

Nguyen Ngoc Hay ◽

Le Ngoc Bich ◽

Dao Xuan Quy

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Viet Nam ◽

Word Error Rate

Bài báo này giới thiệu kết quả đánh giá các hệ thống nhận dạng giọng nói tiếng Việt (VASP-Vietnamese Automatic Speech Recognition) trong bản tin từ các công ty hàng đầu của Việt Nam như Vais (Vietnam AI System), Viettel, Zalo, Fpt và công ty hàng đầu thế giới Google. Để đánh giá các hệ thống nhận dạng giọng nói, chúng tôi sử dụng hệ số Word Error Rate (WER) với đầu vào là văn bản thu được từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP. Ở đây, chúng tôi sử dụng tập tin âm thanh là các bản tin và API từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP để đưa ra văn bản được nhận dạng tương ứng. Kết quả so sánh WER từ Vais, Viettel, Zalo, Fpt và Google cho thấy hệ thống nhận dạng tiếng nói tiếng Việt trong các bản tin từ Viettel, Zalo, Fpt và Google đều có kết quả tốt, trong đó Vais cho kết quả vượt trội hơn.

Download Full-text

Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

Information ◽

10.3390/info12020062 ◽

2021 ◽

Vol 12 (2) ◽

pp. 62 ◽

Cited By ~ 1

Author(s):

Eshete Derb Emiru ◽

Shengwu Xiong ◽

Yaxing Li ◽

Awet Fesseha ◽

Moussa Diallo

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Data Augmentation ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Attention Model ◽

Recognition Systems ◽

End To End ◽

Connectionist Temporal Classification

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.

Download Full-text