Effects of Bayesian predictive classification using variational Bayesian posteriors for sparse training data in speech recognition

In this paper, we propose a new method for code-switching (CS) automatic speech recognition (ASR) in Korean. First, the phonetic variations in English pronunciation spoken by Korean speakers should be considered. Thus, we tried to find a unified pronunciation model based on phonetic knowledge and deep learning. Second, we extracted the CS sentences semantically similar to the target domain and then applied the language model (LM) adaptation to solve the biased modeling toward Korean due to the imbalanced training data. In this experiment, training data were AI Hub (1033 h) in Korean and Librispeech (960 h) in English. As a result, when compared to the baseline, the proposed method improved the error reduction rate (ERR) by up to 11.6% with phonetic variant modeling and by 17.3% when semantically similar sentences were applied to the LM adaptation. If we considered only English words, the word correction rate improved up to 24.2% compared to that of the baseline. The proposed method seems to be very effective in CS speech recognition.

Download Full-text

Limited Training Data Robust Speech Recognition Using Kernel-Based Acoustic Models

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings ◽

10.1109/icassp.2006.1660226 ◽

2006 ◽

Cited By ~ 1

Author(s):

M. Schaffoner ◽

S.E. Kruger ◽

E. Andelic ◽

M. Katz ◽

A. Wendemuth

Keyword(s):

Speech Recognition ◽

Training Data ◽

Robust Speech Recognition ◽

Acoustic Models

Download Full-text

An i-vector based approach to training data clustering for improved speech recognition

10.21437/interspeech.2011-179 ◽

2011 ◽

Author(s):

Yu Zhang ◽

Jian Xu ◽

Zhi-Jie Yan ◽

Qiang Huo

Keyword(s):

Speech Recognition ◽

Data Clustering ◽

Training Data

Download Full-text

UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3194-3202 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3194 ◽

Cited By ~ 1

Author(s):

Aye Nyein Mon ◽

Win Pa Pa ◽

Ye Kyaw Thu

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Error Rates ◽

Training Data ◽

Speech Corpus ◽

Total Size ◽

Test Sets ◽

Web News

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i03.5649 ◽

2020 ◽

Vol 34 (03) ◽

pp. 2645-2652 ◽

Cited By ~ 2

Author(s):

Yaman Kumar ◽

Dhruva Sahrawat ◽

Shubham Maheshwari ◽

Debanjan Mahata ◽

Amanda Stent ◽

...

Keyword(s):

Speech Recognition ◽

Classification Problem ◽

Visual Speech ◽

Training Data ◽

Generative Adversarial Networks ◽

Adversarial Networks ◽

Novel Approach ◽

Visual Speech Recognition ◽

Training Samples ◽

English Training

Visual Speech Recognition (VSR) is the process of recognizing or interpreting speech by watching the lip movements of the speaker. Recent machine learning based approaches model VSR as a classification problem; however, the scarcity of training data leads to error-prone systems with very low accuracies in predicting unseen classes. To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases. We also show that our models are language agnostic and therefore capable of seamlessly generating, using English training data, videos for a new language (Hindi). To the best of our knowledge, this is the first work to show empirical evidence of the use of GANs for generating training samples of unseen classes in the domain of VSR, hence facilitating zero-shot learning. We make the added videos for new classes publicly available along with our code1.

Download Full-text

Training Wideband Acoustic Models in the Cepstral Domain Using Mixed-Bandwidth Training Data for Speech Recognition

The Journal of the Acoustical Society of America ◽

10.1121/1.3625676 ◽

2011 ◽

Vol 130 (2) ◽

pp. 1087

Author(s):

Michael L. Seltzer ◽

Alejandro Acero

Keyword(s):

Speech Recognition ◽

Training Data ◽

Acoustic Models

Download Full-text

Hyperparameter estimation for speech recognition based on variational Bayesian approach

The Journal of the Acoustical Society of America ◽

10.1121/1.4787217 ◽

2006 ◽

Vol 120 (5) ◽

pp. 3042-3042 ◽

Cited By ~ 1

Author(s):

Kei Hashimoto ◽

Heiga Zen ◽

Yoshihiko Nankaku ◽

Lee Akinobu ◽

Keiichi Tokuda

Keyword(s):

Speech Recognition ◽

Bayesian Approach ◽

Variational Bayesian ◽

Hyperparameter Estimation

Download Full-text

High-performance robust speech recognition using stereo training data

2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221) ◽

10.1109/icassp.2001.940827 ◽

2002 ◽

Cited By ~ 46

Author(s):

Li Deng ◽

A. Acero ◽

Li Jiang ◽

J. Droppo ◽

Xuedong Huang

Keyword(s):

Speech Recognition ◽

High Performance ◽

Training Data ◽

Robust Speech Recognition

Download Full-text

Improving speech recognition using limited accent diverse British English training data with deep neural networks

2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP) ◽

10.1109/mlsp.2016.7738854 ◽

2016 ◽

Cited By ~ 1

Author(s):

Maryam Najafian ◽

Saeid Safavi ◽

John H. L. Hansen ◽

Martin Russell

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Training Data ◽

British English ◽

English Training

Download Full-text

Application of an Isolated Word Speech Recognition System in the Field of Mental Health Consultation: Development and Usability Study

JMIR Medical Informatics ◽

10.2196/18677 ◽

2020 ◽

Vol 8 (6) ◽

pp. e18677

Author(s):

Weifeng Fu

Keyword(s):

Mental Health ◽

Speech Recognition ◽

Psychological Treatment ◽

Recognition System ◽

Training Data ◽

Mental Health Consultation ◽

Speech Recognition System ◽

Parallel Operation ◽

Endpoint Detection ◽

Health Counseling

Background Speech recognition is a technology that enables machines to understand human language. Objective In this study, speech recognition of isolated words from a small vocabulary was applied to the field of mental health counseling. Methods A software platform was used to establish a human-machine chat for psychological counselling. The software uses voice recognition technology to decode the user's voice information. The software system analyzes and processes the user's voice information according to many internal related databases, and then gives the user accurate feedback. For users who need psychological treatment, the system provides them with psychological education. Results The speech recognition system included features such as speech extraction, endpoint detection, feature value extraction, training data, and speech recognition. Conclusions The Hidden Markov Model was adopted, based on multithread programming under a VC2005 compilation environment, to realize the parallel operation of the algorithm and improve the efficiency of speech recognition. After the design was completed, simulation debugging was performed in the laboratory. The experimental results showed that the designed program met the basic requirements of a speech recognition system.

Download Full-text