scholarly journals An Effective Learning Method for Automatic Speech Recognition in Korean CI Patients’ Speech

Electronics ◽  
2021 ◽  
Vol 10 (7) ◽  
pp. 807
Author(s):  
Jiho Jeong ◽  
S. I. M. M. Raton Mondol ◽  
Yeon Wook Kim ◽  
Sangmin Lee

The automatic speech recognition (ASR) model usually requires a large amount of training data to provide better results compared with the ASR models trained with a small amount of training data. It is difficult to apply the ASR model to non-standard speech such as that of cochlear implant (CI) patients, owing to privacy concerns or difficulty of access. In this paper, an effective finetuning and augmentation ASR model is proposed. Experiments compare the character error rate (CER) after training the ASR model with the basic and the proposed method. The proposed method achieved a CER of 36.03% on the CI patient’s speech test dataset using only 2 h and 30 min of training data, which is a 62% improvement over the basic method.

Author(s):  
Aye Nyein Mon ◽  
Win Pa Pa ◽  
Ye Kyaw Thu

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models  such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data  sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR  using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />


2021 ◽  
Vol 13 (0) ◽  
pp. 1-5
Author(s):  
Mantas Tamulionis

Methods based on artificial neural networks (ANN) are widely used in various audio signal processing tasks. This provides opportunities to optimize processes and save resources required for calculations. One of the main objects we need to get to numerically capture the acoustics of a room is the room impulse response (RIR). Increasingly, research authors choose not to record these impulses in a real room but to generate them using ANN, as this gives them the freedom to prepare unlimited-sized training datasets. Neural networks are also used to augment the generated impulses to make them similar to the ones actually recorded. The widest use of ANN so far is observed in the evaluation of the generated results, for example, in automatic speech recognition (ASR) tasks. This review also describes datasets of recorded RIR impulses commonly found in various studies that are used as training data for neural networks.


Author(s):  
Lakshika Kavmini ◽  
Thilini Dinushika ◽  
Uthayasanker Thayasivam ◽  
Sanath Jayasena

The recent advancements in conversational Artificial Intelligence (AI) are fastly getting integrated with every realm of human lives. Conversational agents who can learn, understand human languages and mimic the human thinking process have already created a revolution in human lifestyle. Understanding the intention of a speaker from his natural speech is a significant step in conversational AI. A major challenge that hinders the efficacy of this process is the lack of language resources. In this research, we address this issue and develop a domain-specific speech command classification system for the Sinhala language, one of the low-resourced languages. An effective speech command classification system can be utilized in several value added applications such as speech dialog systems. Our speech command classification system is developed by integrating Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). The ASR engine is implemented using Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) and it converts a Sinhala speech command into a corresponding text representation. The text classifier, which is implemented as an ensemble unit of several classifiers, predicts the intent of the speaker when provided with the above text output. In this paper, we discuss and evaluate various algorithms and techniques that can be utilized to optimize the performance of both the ASR and text classifier. As well, we present our novel Sinhala speech data corpus of 4.15[Formula: see text]h which is based on the banking domain. As the final outcome, our system reports its Sinhala speech command classification accuracy as 91.03%. It shows that our system outperforms the state-of-the-art speech-to-intent mapping systems developed for the Sinhala language. The individual evaluation on the ASR system reports a 9.91% Word Error Rate and a 19.95% Sentence Error Rate, suggesting the applicability of advanced speech recognition techniques despite the limited language resources. Finally, our findings deliver useful insights on further research in speech command classification in the low-resourced context.


2007 ◽  
Author(s):  
Waldo Nogueira ◽  
Tamás Harczos ◽  
Bernd Edler ◽  
Jörn Ostermann ◽  
Andreas Büchner

2020 ◽  
Vol 2 (2) ◽  
pp. 7-13
Author(s):  
Andi Nasri

Dengan semakin berkembangnya teknologi speech recognition, berbagai software yang bertujuan untuk memudahkan orang tunarungu dalam berkomunikasi dengan yang lainnya telah dikembangkan. Sistem tersebut menterjemahkan suara ucapan menjadi bahasa isyarat atau sebaliknya bahasa isyarat diterjemahkan ke suara ucapan. Sistem tersebut sudah dikembangkan dalam berbagai bahasa seperti bahasa Inggris, Arab, Spanyol, Meksiko, Indonesia dan lain-lain. Khusus untuk bahasa Indonesia mulai juga sudah yang mencoba melakukan penelitian untuk membuat system seperti tersebut. Namun system yang dibuat masih terbatas pada Automatic Speech Recognition (ASR) yang digunakan dimana mempunyai kosa-kata yang terbatas. Dalam penelitian ini bertujuan untuk mengembangkan sistem penterjemah suara ucapan bahasa Indonesia ke Sistem Bahasa Isyarat Indonesia (SIBI) dengan data korpus yang lebih besar dan meggunkanan continue speech recognition  untuk meningkatkan akurasi system.Dari hasil pengujian system menunjukan diperoleh hasil akurasi sebesar rata-rata 90,50 % dan Word Error Rate (WER)  9,50%. Hasil akurasi lebih tinggi dibandingkan penelitian kedua  48,75%  dan penelitan pertama 66,67%. Disamping itu system juga dapat mengenali kata yang diucapkan secara kontinyu atau pengucapan kalimat. Kemudian hasil pengujian kinerja system mencapai         0,83 detik untuk Speech to Text  dan 8,25 detik untuk speech to sign.


Author(s):  
Nguyen Thi My Thanh ◽  
Phan Xuan Dung ◽  
Nguyen Ngoc Hay ◽  
Le Ngoc Bich ◽  
Dao Xuan Quy

Bài báo này giới thiệu kết quả đánh giá các hệ thống nhận dạng giọng nói tiếng Việt (VASP-Vietnamese Automatic Speech Recognition) trong bản tin từ các công ty hàng đầu của Việt Nam như Vais (Vietnam AI System), Viettel, Zalo, Fpt và công ty hàng đầu thế giới Google. Để đánh giá các hệ thống nhận dạng giọng nói, chúng tôi sử dụng hệ số Word Error Rate (WER) với đầu vào là văn bản thu được từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP. Ở đây, chúng tôi sử dụng tập tin âm thanh là các bản tin và API từ các hệ thống Vais VASP, Viettel VASP, Zalo VASP, Fpt VASP và Google VASP để đưa ra văn bản được nhận dạng tương ứng. Kết quả so sánh WER từ Vais, Viettel, Zalo, Fpt và Google cho thấy hệ thống nhận dạng tiếng nói tiếng Việt trong các bản tin từ Viettel, Zalo, Fpt và Google đều có kết quả tốt, trong đó Vais cho kết quả vượt trội hơn.


Electronics ◽  
2021 ◽  
Vol 10 (24) ◽  
pp. 3172
Author(s):  
Qingran Zhan ◽  
Xiang Xie ◽  
Chenguang Hu ◽  
Juan Zuluaga-Gomez ◽  
Jing Wang ◽  
...  

Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multi-stream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages.


Sign in / Sign up

Export Citation Format

Share Document