Audio Codec Simulation based Data Augmentation for Telephony Speech Recognition

Speech recognition has been an active field of research in the last few decades since it facilitates better human–computer interaction. Native language automatic speech recognition (ASR) systems are still underdeveloped. Punjabi ASR systems are in their infancy stage because most research has been conducted only on adult speech systems; however, less work has been performed on Punjabi children’s ASR systems. This research aimed to build a prosodic feature-based automatic children speech recognition system using discriminative modeling techniques. The corpus of Punjabi children’s speech has various runtime challenges, such as acoustic variations with varying speakers’ ages. Efforts were made to implement out-domain data augmentation to overcome such issues using Tacotron-based text to a speech synthesizer. The prosodic features were extracted from Punjabi children’s speech corpus, then particular prosodic features were coupled with Mel Frequency Cepstral Coefficient (MFCC) features before being submitted to an ASR framework. The system modeling process investigated various approaches, which included Maximum Mutual Information (MMI), Boosted Maximum Mutual Information (bMMI), and feature-based Maximum Mutual Information (fMMI). The out-domain data augmentation was performed to enhance the corpus. After that, prosodic features were also extracted from the extended corpus, and experiments were conducted on both individual and integrated prosodic-based acoustic features. It was observed that the fMMI technique exhibited 20% to 25% relative improvement in word error rate compared with MMI and bMMI techniques. Further, it was enhanced using an augmented dataset and hybrid front-end features (MFCC + POV + Fo + Voice quality) with a relative improvement of 13% compared with the earlier baseline system.

Download Full-text

Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) ◽

10.1109/asru.2017.8268911 ◽

2017 ◽

Cited By ~ 17

Author(s):

Wei-Ning Hsu ◽

Yu Zhang ◽

James Glass

Keyword(s):

Speech Recognition ◽

Data Augmentation ◽

Domain Adaptation ◽

Robust Speech Recognition ◽

Unsupervised Domain Adaptation ◽

Variational Autoencoder

Download Full-text

Fundamental frequency feature warping for frequency normalization and data augmentation in child automatic speech recognition

Speech Communication ◽

10.1016/j.specom.2021.08.002 ◽

2021 ◽

Author(s):

Gary Yeung ◽

Ruchao Fan ◽

Abeer Alwan

Keyword(s):

Speech Recognition ◽

Fundamental Frequency ◽

Automatic Speech Recognition ◽

Data Augmentation ◽

Frequency Feature

Download Full-text

Improving code-switching speech recognition with data augmentation and system combination

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipaasc47483.2019.9023316 ◽

2019 ◽

Author(s):

Duo Ma ◽

Guanyu Li ◽

Haihua Xu ◽

Eng Siong Chng

Keyword(s):

Speech Recognition ◽

Data Augmentation ◽

Code Switching ◽

System Combination

Download Full-text

Data Augmentation and Teacher-Student Training for LF-MMI Based Robust Speech Recognition

Text, Speech, and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-030-00794-2_43 ◽

2018 ◽

pp. 403-410

Author(s):

Asadullah ◽

Tanel Alumäe

Keyword(s):

Speech Recognition ◽

Data Augmentation ◽

Robust Speech Recognition ◽

Teacher Student ◽

Student Training

Download Full-text

Adequacy Analysis of Simulation-Based Assessment of Speech Recognition System

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07 ◽

10.1109/icassp.2007.367279 ◽

2007 ◽

Author(s):

Tetsuji Ogawa ◽

Satoshi Kanba ◽

Tetsunori Kobayashi

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Simulation Based

Download Full-text

SIMULATION-BASED DATA AUGMENTATION USING PHYSICAL PRIORS FOR NOISE FILTERING DEEP NEURAL NETWORK

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b2-2020-247-2020 ◽

2020 ◽

Vol XLIII-B2-2020 ◽

pp. 247-254

Author(s):

M. Jameela ◽

L. Chen ◽

A. Sit ◽

J. Yoo ◽

C. Verheggen ◽

...

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Data Augmentation ◽

Pulse Repetition Frequency ◽

Single Pulse ◽

Airborne Lidar ◽

Noise Filtering ◽

Learning Networks ◽

Simulation Based ◽

Multiple Pulses

Abstract. LiDAR (Light Detection and Ranging) mounted with static and mobile vehicles has been rapidly adopted as a primary sensor for mapping natural and built environments for a range of civil and military applications. Recently, technology advancement in electro-optical engineering enables acquiring laser returns at high pulse repetition frequency (PRF) from 100Hz to 2MHz for airborne LiDAR, which leads to an increase in the density of 3D point cloud significantly. Traditional systems with lower PRF had a single pulse-in-air zone (PIA) big enough to avoid a mismatch between pulse pair at the receiver. Modern multiple pulses-in-air (MPIA) technology ensures multiple windows of operational ranges for single flight line and no blind-zones; downside of the technology is projection of atmospheric returns closer to same PIA zone of neighbouring ground points and more likely to be overlapping with objects of interest. These characteristics of noise compromise the quality of the scene and encourage usage of noise filtering neural network as existing filters are not effective. A noise filtering deep neural network requires a considerable volume of the diverse annotated dataset, which is expensive. We developed simulation for data augmentation based on physical priors and Gaussian generative function. Our study compares deep learning networks for noise filtering and shows performance gain on 3D U-Net. Then, we evaluate 3D U-Net for simulation-based data augmentation, which shows an increase in precision and F1-score. We also provide an analysis of the underline spatial distribution of points and their impact on data augmentation, and noise filtering.

Download Full-text

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

10.21437/interspeech.2021-1162 ◽

2021 ◽

Author(s):

Jianwei Sun ◽

Zhiyuan Tang ◽

Hengxin Yin ◽

Wei Wang ◽

Xi Zhao ◽

...

Keyword(s):

Speech Recognition ◽

Data Augmentation ◽

Semantic Data ◽

End To End ◽

Mandarin Speech Recognition

Download Full-text

A semantic parsing pipeline for context-dependent question answering over temporally structured data

Natural Language Engineering ◽

10.1017/s1351324921000292 ◽

2021 ◽

pp. 1-25

Author(s):

Charles Chen ◽

Razvan Bunescu ◽

Cindy Marling

Keyword(s):

Speech Recognition ◽

Question Answering ◽

Data Augmentation ◽

Short Term Memory ◽

Context Dependency ◽

Sensor Data ◽

Multiple Time ◽

Semantic Parsing ◽

Multiple Levels ◽

And Behavior

Abstract We propose a new setting for question answering (QA) in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity’s state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained long short-term memory (LSTM)-based architecture to the user’s speech, whereas for the semantic parsing component we introduce an LSTM-based encoder–decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new QA paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.

Download Full-text