Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text

AN OVERVIEW OF METHODS FOR GENERATING, AUGMENTING AND EVALUATING ROOM IMPULSE RESPONSE USING ARTIFICIAL NEURAL NETWORKS

Mokslas - Lietuvos ateitis ◽

10.3846/mla.2021.15152 ◽

2021 ◽

Vol 13 (0) ◽

pp. 1-5

Author(s):

Mantas Tamulionis

Keyword(s):

Neural Networks ◽

Signal Processing ◽

Artificial Neural Networks ◽

Speech Recognition ◽

Impulse Response ◽

Automatic Speech Recognition ◽

Audio Signal ◽

Training Data ◽

Audio Signal Processing ◽

Artificial Neural

Methods based on artificial neural networks (ANN) are widely used in various audio signal processing tasks. This provides opportunities to optimize processes and save resources required for calculations. One of the main objects we need to get to numerically capture the acoustics of a room is the room impulse response (RIR). Increasingly, research authors choose not to record these impulses in a real room but to generate them using ANN, as this gives them the freedom to prepare unlimited-sized training datasets. Neural networks are also used to augment the generated impulses to make them similar to the ones actually recorded. The widest use of ANN so far is observed in the evaluation of the generated results, for example, in automatic speech recognition (ASR) tasks. This review also describes datasets of recorded RIR impulses commonly found in various studies that are used as training data for neural networks.

Download Full-text

Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks

Applied Sciences ◽

10.3390/app11188412 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8412

Author(s):

Hyeong-Ju Na ◽

Jeong-Sik Park

Keyword(s):

Speech Recognition ◽

Domain Adaptation ◽

Training Data ◽

Baseline Model ◽

Linguistic Differences ◽

Computational Costs ◽

Accented Speech ◽

Feature Extractor ◽

Adversarial Training ◽

End To End

The performance of automatic speech recognition (ASR) may be degraded when accented speech is recognized because the speech has some linguistic differences from standard speech. Conventional accented speech recognition studies have utilized the accent embedding method, in which the accent embedding features are directly fed into the ASR network. Although the method improves the performance of accented speech recognition, it has some restrictions, such as increasing the computational costs. This study proposes an efficient method of training the ASR model for accented speech in a domain adversarial way based on the Domain Adversarial Neural Network (DANN). The DANN plays a role as a domain adaptation in which the training data and test data have different distributions. Thus, our approach is expected to construct a reliable ASR model for accented speech by reducing the distribution differences between accented speech and standard speech. DANN has three sub-networks: the feature extractor, the domain classifier, and the label predictor. To adjust the DANN for accented speech recognition, we constructed these three sub-networks independently, considering the characteristics of accented speech. In particular, we used an end-to-end framework based on Connectionist Temporal Classification (CTC) to develop the label predictor, a very important module that directly affects ASR results. To verify the efficiency of the proposed approach, we conducted several experiments of accented speech recognition for four English accents including Australian, Canadian, British (England), and Indian accents. The experimental results showed that the proposed DANN-based model outperformed the baseline model for all accents, indicating that the end-to-end domain adversarial training effectively reduced the distribution differences between accented speech and standard speech.

Download Full-text

An Effective Learning Method for Automatic Speech Recognition in Korean CI Patients’ Speech

Electronics ◽

10.3390/electronics10070807 ◽

2021 ◽

Vol 10 (7) ◽

pp. 807

Author(s):

Jiho Jeong ◽

S. I. M. M. Raton Mondol ◽

Yeon Wook Kim ◽

Sangmin Lee

Keyword(s):

Speech Recognition ◽

Cochlear Implant ◽

Automatic Speech Recognition ◽

Error Rate ◽

Training Data ◽

Learning Method ◽

Test Dataset ◽

Privacy Concerns ◽

Effective Learning ◽

Speech Test

The automatic speech recognition (ASR) model usually requires a large amount of training data to provide better results compared with the ASR models trained with a small amount of training data. It is difficult to apply the ASR model to non-standard speech such as that of cochlear implant (CI) patients, owing to privacy concerns or difficulty of access. In this paper, an effective finetuning and augmentation ASR model is proposed. Experiments compare the character error rate (CER) after training the ASR model with the basic and the proposed method. The proposed method achieved a CER of 36.03% on the CI patient’s speech test dataset using only 2 h and 30 min of training data, which is a 62% improvement over the basic method.

Download Full-text

Supervised and unsupervised active learning for automatic speech recognition of low-resource languages

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2016.7472693 ◽

2016 ◽

Cited By ~ 2

Author(s):

Ali Raza Syed ◽

Andrew Rosenberg ◽

Ellen Kislal

Keyword(s):

Speech Recognition ◽

Active Learning ◽

Automatic Speech Recognition ◽

Low Resource

Download Full-text

Minimally Supervised Number Normalization

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00114 ◽

2016 ◽

Vol 4 ◽

pp. 507-519 ◽

Cited By ~ 4

Author(s):

Kyle Gorman ◽

Richard Sproat

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Recurrent Neural Network ◽

Training Data ◽

Minimal Amount ◽

Low Resource ◽

Finite State Transducers ◽

Perfect Performance ◽

Finite State ◽

Minimally Supervised

We propose two models for verbalizing numbers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The second model, drawing inspiration from the linguistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages.

Download Full-text