An Effective Learning Method for Automatic Speech Recognition in Korean CI Patients’ Speech

The automatic speech recognition (ASR) model usually requires a large amount of training data to provide better results compared with the ASR models trained with a small amount of training data. It is difficult to apply the ASR model to non-standard speech such as that of cochlear implant (CI) patients, owing to privacy concerns or difficulty of access. In this paper, an effective finetuning and augmentation ASR model is proposed. Experiments compare the character error rate (CER) after training the ASR model with the basic and the proposed method. The proposed method achieved a CER of 36.03% on the CI patient’s speech test dataset using only 2 h and 30 min of training data, which is a 62% improvement over the basic method.

Download Full-text

A study on model-based error rate estimation for automatic speech recognition

IEEE Transactions on Speech and Audio Processing ◽

10.1109/tsa.2003.818030 ◽

2003 ◽

Vol 11 (6) ◽

pp. 581-589 ◽

Cited By ~ 7

Author(s):

Chao-Shih Huang ◽

Hsiao-Chuan Wang ◽

Chin-Hui Lee

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Rate Estimation ◽

Model Based

Download Full-text

UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3194-3202 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3194 ◽

Cited By ~ 1

Author(s):

Aye Nyein Mon ◽

Win Pa Pa ◽

Ye Kyaw Thu

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Error Rates ◽

Training Data ◽

Speech Corpus ◽

Total Size ◽

Test Sets ◽

Web News

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text

AN OVERVIEW OF METHODS FOR GENERATING, AUGMENTING AND EVALUATING ROOM IMPULSE RESPONSE USING ARTIFICIAL NEURAL NETWORKS

Mokslas - Lietuvos ateitis ◽

10.3846/mla.2021.15152 ◽

2021 ◽

Vol 13 (0) ◽

pp. 1-5

Author(s):

Mantas Tamulionis

Keyword(s):

Neural Networks ◽

Signal Processing ◽

Artificial Neural Networks ◽

Speech Recognition ◽

Impulse Response ◽

Automatic Speech Recognition ◽

Audio Signal ◽

Training Data ◽

Audio Signal Processing ◽

Artificial Neural

Methods based on artificial neural networks (ANN) are widely used in various audio signal processing tasks. This provides opportunities to optimize processes and save resources required for calculations. One of the main objects we need to get to numerically capture the acoustics of a room is the room impulse response (RIR). Increasingly, research authors choose not to record these impulses in a real room but to generate them using ANN, as this gives them the freedom to prepare unlimited-sized training datasets. Neural networks are also used to augment the generated impulses to make them similar to the ones actually recorded. The widest use of ANN so far is observed in the evaluation of the generated results, for example, in automatic speech recognition (ASR) tasks. This review also describes datasets of recorded RIR impulses commonly found in various studies that are used as training data for neural networks.

Download Full-text

Improved Speech Command Classification System for Sinhala Language based on Automatic Speech Recognition

International Journal of Asian Language Processing ◽

10.1142/s2717554520500095 ◽

2020 ◽

pp. 2050009

Author(s):

Lakshika Kavmini ◽

Thilini Dinushika ◽

Uthayasanker Thayasivam ◽

Sanath Jayasena

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Classification System ◽

Value Added ◽

Gaussian Mixture ◽

Conversational Agents ◽

Language Resources ◽

Dialog Systems ◽

The Individual

The recent advancements in conversational Artificial Intelligence (AI) are fastly getting integrated with every realm of human lives. Conversational agents who can learn, understand human languages and mimic the human thinking process have already created a revolution in human lifestyle. Understanding the intention of a speaker from his natural speech is a significant step in conversational AI. A major challenge that hinders the efficacy of this process is the lack of language resources. In this research, we address this issue and develop a domain-specific speech command classification system for the Sinhala language, one of the low-resourced languages. An effective speech command classification system can be utilized in several value added applications such as speech dialog systems. Our speech command classification system is developed by integrating Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). The ASR engine is implemented using Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) and it converts a Sinhala speech command into a corresponding text representation. The text classifier, which is implemented as an ensemble unit of several classifiers, predicts the intent of the speaker when provided with the above text output. In this paper, we discuss and evaluate various algorithms and techniques that can be utilized to optimize the performance of both the ASR and text classifier. As well, we present our novel Sinhala speech data corpus of 4.15[Formula: see text]h which is based on the banking domain. As the final outcome, our system reports its Sinhala speech command classification accuracy as 91.03%. It shows that our system outperforms the state-of-the-art speech-to-intent mapping systems developed for the Sinhala language. The individual evaluation on the ASR system reports a 9.91% Word Error Rate and a 19.95% Sentence Error Rate, suggesting the applicability of advanced speech recognition techniques despite the limited language resources. Finally, our findings deliver useful insights on further research in speech command classification in the low-resourced context.

Download Full-text

Automatic speech recognition with a cochlear implant front-end

10.21437/interspeech.2007-674 ◽

2007 ◽

Author(s):

Waldo Nogueira ◽

Tamás Harczos ◽

Bernd Edler ◽

Jörn Ostermann ◽

Andreas Büchner

Keyword(s):

Speech Recognition ◽

Cochlear Implant ◽

Automatic Speech Recognition ◽

Front End

Download Full-text

Konversi Suara Ucapan Bahasa Indonesia Ke Sistem Bahasa Isyarat Indonesia (Sibi)

Ainet : Jurnal Informatika ◽

10.26618/ainet.v2i2.4025 ◽

2020 ◽

Vol 2 (2) ◽

pp. 7-13

Author(s):

Andi Nasri

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Word Error Rate ◽

Bahasa Indonesia

Dengan semakin berkembangnya teknologi speech recognition, berbagai software yang bertujuan untuk memudahkan orang tunarungu dalam berkomunikasi dengan yang lainnya telah dikembangkan. Sistem tersebut menterjemahkan suara ucapan menjadi bahasa isyarat atau sebaliknya bahasa isyarat diterjemahkan ke suara ucapan. Sistem tersebut sudah dikembangkan dalam berbagai bahasa seperti bahasa Inggris, Arab, Spanyol, Meksiko, Indonesia dan lain-lain. Khusus untuk bahasa Indonesia mulai juga sudah yang mencoba melakukan penelitian untuk membuat system seperti tersebut. Namun system yang dibuat masih terbatas pada Automatic Speech Recognition (ASR) yang digunakan dimana mempunyai kosa-kata yang terbatas. Dalam penelitian ini bertujuan untuk mengembangkan sistem penterjemah suara ucapan bahasa Indonesia ke Sistem Bahasa Isyarat Indonesia (SIBI) dengan data korpus yang lebih besar dan meggunkanan continue speech recognition untuk meningkatkan akurasi system.Dari hasil pengujian system menunjukan diperoleh hasil akurasi sebesar rata-rata 90,50 % dan Word Error Rate (WER) 9,50%. Hasil akurasi lebih tinggi dibandingkan penelitian kedua 48,75% dan penelitan pertama 66,67%. Disamping itu system juga dapat mengenali kata yang diucapkan secara kontinyu atau pengucapan kalimat. Kemudian hasil pengujian kinerja system mencapai 0,83 detik untuk Speech to Text dan 8,25 detik untuk speech to sign.

Download Full-text

Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition

10.21437/interspeech.2020-1195 ◽

2020 ◽

Author(s):

Kohei Matsuura ◽

Masato Mimura ◽

Shinsuke Sakai ◽

Tatsuya Kawahara

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Training Data ◽

Low Resource ◽

Adversarial Training

Download Full-text

Đánh giá các hệ thống nhận dạng giọng nói tiếng việt (vais, viettel, zalo, fpt và google) trong bản tin

Journal of Technical Education Science ◽

10.54644/jte.63.2021.46 ◽

2021 ◽

pp. 28-35

Author(s):

Nguyen Thi My Thanh ◽

Phan Xuan Dung ◽

Nguyen Ngoc Hay ◽

Le Ngoc Bich ◽

Dao Xuan Quy

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Rate ◽

Viet Nam ◽

Word Error Rate

Bài báo này giới thiệu kết quả đánh giá các hệ thống nhận dạng giọng nói tiếng Việt (VASP-Vietnamese Automatic Speech Recognition) trong bản tin từ các công ty hàng đầu của Việt Nam như Vais (Vietnam AI System), Viettel, Zalo, Fpt và công ty hàng đầu thế giới Google. Để đánh giá các hệ thống nhận dạng giọng nói, chúng tôi sử dụng hệ số Word Error Rate (WER) với đầu vào là văn bản thu được từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP. Ở đây, chúng tôi sử dụng tập tin âm thanh là các bản tin và API từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP để đưa ra văn bản được nhận dạng tương ứng. Kết quả so sánh WER từ Vais, Viettel, Zalo, Fpt và Google cho thấy hệ thống nhận dạng tiếng nói tiếng Việt trong các bản tin từ Viettel, Zalo, Fpt và Google đều có kết quả tốt, trong đó Vais cho kết quả vượt trội hơn.

Download Full-text

Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition

Electronics ◽

10.3390/electronics10243172 ◽

2021 ◽

Vol 10 (24) ◽

pp. 3172

Author(s):

Qingran Zhan ◽

Xiang Xie ◽

Chenguang Hu ◽

Juan Zuluaga-Gomez ◽

Jing Wang ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Speech Recognition ◽

Training Data ◽

Target Language ◽

Learning Method ◽

Acoustic Features ◽

Adversarial Learning ◽

Phonological Knowledge ◽

Cross Lingual

Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multi-stream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages.

Download Full-text