Word error rate improvement and complexity reduction in Automatic Speech Recognition by analyzing acoustic model uncertainty and confusion

Konversi Suara Ucapan Bahasa Indonesia Ke Sistem Bahasa Isyarat Indonesia (Sibi)

Ainet : Jurnal Informatika ◽

10.26618/ainet.v2i2.4025 ◽

2020 ◽

Vol 2 (2) ◽

pp. 7-13

Author(s):

Andi Nasri

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Word Error Rate ◽

Bahasa Indonesia

Dengan semakin berkembangnya teknologi speech recognition, berbagai software yang bertujuan untuk memudahkan orang tunarungu dalam berkomunikasi dengan yang lainnya telah dikembangkan. Sistem tersebut menterjemahkan suara ucapan menjadi bahasa isyarat atau sebaliknya bahasa isyarat diterjemahkan ke suara ucapan. Sistem tersebut sudah dikembangkan dalam berbagai bahasa seperti bahasa Inggris, Arab, Spanyol, Meksiko, Indonesia dan lain-lain. Khusus untuk bahasa Indonesia mulai juga sudah yang mencoba melakukan penelitian untuk membuat system seperti tersebut. Namun system yang dibuat masih terbatas pada Automatic Speech Recognition (ASR) yang digunakan dimana mempunyai kosa-kata yang terbatas. Dalam penelitian ini bertujuan untuk mengembangkan sistem penterjemah suara ucapan bahasa Indonesia ke Sistem Bahasa Isyarat Indonesia (SIBI) dengan data korpus yang lebih besar dan meggunkanan continue speech recognition untuk meningkatkan akurasi system.Dari hasil pengujian system menunjukan diperoleh hasil akurasi sebesar rata-rata 90,50 % dan Word Error Rate (WER) 9,50%. Hasil akurasi lebih tinggi dibandingkan penelitian kedua 48,75% dan penelitan pertama 66,67%. Disamping itu system juga dapat mengenali kata yang diucapkan secara kontinyu atau pengucapan kalimat. Kemudian hasil pengujian kinerja system mencapai 0,83 detik untuk Speech to Text dan 8,25 detik untuk speech to sign.

Download Full-text

Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v6i2.2684 ◽

2020 ◽

Vol 6 (2) ◽

Author(s):

Vincent Elbert Budiman ◽

Andreas Widjaja

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Language Model ◽

Beam Width ◽

Acoustic Model ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Word Error Rate ◽

Testing Data ◽

Bahasa Indonesia

Here a development of an Acoustic and Language Model is presented. Low Word Error Rate is an early good sign of a good Language and Acoustic Model. Although there are still parameters other than Words Error Rate, our work focused on building Bahasa Indonesia with approximately 2000 common words and achieved the minimum threshold of 25% Word Error Rate. There were several experiments consist of different cases, training data, and testing data with Word Error Rate and Testing Ratio as the main comparison. The language and acoustic model were built using Sphinx4 from Carnegie Mellon University using Hidden Markov Model for the acoustic model and ARPA Model for the language model. The models configurations, which are Beam Width and Force Alignment, directly correlates with Word Error Rate. The configurations were set to 1e-80 for Beam Width and 1e-60 for Force Alignment to prevent underfitting or overfitting of the acoustic model. The goals of this research are to build continuous speech recognition in Bahasa Indonesia which has low Word Error Rate and to determine the optimum numbers of training and testing data which minimize the Word Error Rate.

Download Full-text

Đánh giá các hệ thống nhận dạng giọng nói tiếng việt (vais, viettel, zalo, fpt và google) trong bản tin

Journal of Technical Education Science ◽

10.54644/jte.63.2021.46 ◽

2021 ◽

pp. 28-35

Author(s):

Nguyen Thi My Thanh ◽

Phan Xuan Dung ◽

Nguyen Ngoc Hay ◽

Le Ngoc Bich ◽

Dao Xuan Quy

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Viet Nam ◽

Word Error Rate

Bài báo này giới thiệu kết quả đánh giá các hệ thống nhận dạng giọng nói tiếng Việt (VASP-Vietnamese Automatic Speech Recognition) trong bản tin từ các công ty hàng đầu của Việt Nam như Vais (Vietnam AI System), Viettel, Zalo, Fpt và công ty hàng đầu thế giới Google. Để đánh giá các hệ thống nhận dạng giọng nói, chúng tôi sử dụng hệ số Word Error Rate (WER) với đầu vào là văn bản thu được từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP. Ở đây, chúng tôi sử dụng tập tin âm thanh là các bản tin và API từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP để đưa ra văn bản được nhận dạng tương ứng. Kết quả so sánh WER từ Vais, Viettel, Zalo, Fpt và Google cho thấy hệ thống nhận dạng tiếng nói tiếng Việt trong các bản tin từ Viettel, Zalo, Fpt và Google đều có kết quả tốt, trong đó Vais cho kết quả vượt trội hơn.

Download Full-text

Langkah Praktis Membangun Sistem Pengenalan Suara dengan HTK

JSAI (Journal Scientific and Applied Informatics) ◽

10.36085/jsai.v2i2.314 ◽

2019 ◽

Vol 2 (2) ◽

pp. 149-153

Author(s):

Zulkarnaen Hatala

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Principle Component Analysis ◽

Hidden Markov ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Word Error Rate ◽

Bahasa Indonesia

Dipaparkan prosedur untuk mengembangkan Sistem Pengenalan Suara otomatis, Automatic Speech Recognition System (ASR) untuk kasus online recognition. Prosedur ini secara cepat dan efisien membangun ASR menggunakan Hidden Markov Toolkit (HTK). Langkah-langkah praktis ini dipaparkan secara jelas untuk mengimplementasikan ASR dengan daftar kata sedikit (Small Vocabulary) dalam contoh kasus pengenalan digit Bahasa Indonesia. Dijelaskan beberapa teknik meningkatkan performansi seperti cara mengatasi noise, pengejaan ganda dan penerapan Principle Component Analysis. Hasil akhir berupa Word Error Rate

Download Full-text

Development of End – to – End Encoder - Decoder Model Applying Voice Recognition System in Different Channels

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1267.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2350-2352

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Voice Recognition ◽

Ground Truth ◽

Recognition System ◽

Training Algorithms ◽

Word Error Rate ◽

End To End ◽

Evaluation Metric

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.

Download Full-text

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Sensors ◽

10.3390/s21093063 ◽

2021 ◽

Vol 21 (9) ◽

pp. 3063

Author(s):

Aleksandr Laptev ◽

Andrei Andrusenko ◽

Ivan Podluzhny ◽

Anton Mitrofanov ◽

Ivan Medennikov ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Rapid Development ◽

Computational Cost ◽

Vocabulary Size ◽

Word Error Rate ◽

Low Resource ◽

Steady Improvement ◽

End To End ◽

Asr System

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition

10.21437/interspeech.2015-597 ◽

2015 ◽

Author(s):

Zhong-Qiu Wang ◽

DeLiang Wang

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Acoustic Model ◽

Speech Separation ◽

Joint Training

Download Full-text

A study on model-based error rate estimation for automatic speech recognition

IEEE Transactions on Speech and Audio Processing ◽

10.1109/tsa.2003.818030 ◽

2003 ◽

Vol 11 (6) ◽

pp. 581-589 ◽

Cited By ~ 7

Author(s):

Chao-Shih Huang ◽

Hsiao-Chuan Wang ◽

Chin-Hui Lee

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Rate Estimation ◽

Model Based

Download Full-text

A De Novo Divide-and-Merge Paradigm for Acoustic Model Optimization in Automatic Speech Recognition

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/513 ◽

2020 ◽

Author(s):

Conghui Tan ◽

Di Jiang ◽

Jinhua Peng ◽

Xueyang Wu ◽

Qian Xu ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

De Novo ◽

Superior Performance ◽

Acoustic Model ◽

Acoustic Models ◽

Public Data ◽

Speech Data ◽

Low Efficiency ◽

Novel Algorithms

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.

Download Full-text

Text Corpus and Acoustic Model Addition for Indonesian-Arabic Code-switching in Automatic Speech Recognition System

2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) ◽

10.1109/icaicta.2019.8904183 ◽

2019 ◽

Author(s):

Rizky Elzandi Barik ◽

Dessi Puji Lestari

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Code Switching ◽

Speech Recognition System ◽

Acoustic Model ◽

Automatic Speech Recognition System ◽

Text Corpus

Download Full-text