Audio Codec Simulation based Data Augmentation for Telephony Speech Recognition

Author(s):  
Thi-Ly Vu ◽  
Zhiping Zeng ◽  
Haihua Xu ◽  
Eng-Siong Chng
2022 ◽  
Vol 14 (2) ◽  
pp. 614
Author(s):  
Taniya Hasija ◽  
Virender Kadyan ◽  
Kalpna Guleria ◽  
Abdullah Alharbi ◽  
Hashem Alyami ◽  
...  

Speech recognition has been an active field of research in the last few decades since it facilitates better human–computer interaction. Native language automatic speech recognition (ASR) systems are still underdeveloped. Punjabi ASR systems are in their infancy stage because most research has been conducted only on adult speech systems; however, less work has been performed on Punjabi children’s ASR systems. This research aimed to build a prosodic feature-based automatic children speech recognition system using discriminative modeling techniques. The corpus of Punjabi children’s speech has various runtime challenges, such as acoustic variations with varying speakers’ ages. Efforts were made to implement out-domain data augmentation to overcome such issues using Tacotron-based text to a speech synthesizer. The prosodic features were extracted from Punjabi children’s speech corpus, then particular prosodic features were coupled with Mel Frequency Cepstral Coefficient (MFCC) features before being submitted to an ASR framework. The system modeling process investigated various approaches, which included Maximum Mutual Information (MMI), Boosted Maximum Mutual Information (bMMI), and feature-based Maximum Mutual Information (fMMI). The out-domain data augmentation was performed to enhance the corpus. After that, prosodic features were also extracted from the extended corpus, and experiments were conducted on both individual and integrated prosodic-based acoustic features. It was observed that the fMMI technique exhibited 20% to 25% relative improvement in word error rate compared with MMI and bMMI techniques. Further, it was enhanced using an augmented dataset and hybrid front-end features (MFCC + POV + Fo + Voice quality) with a relative improvement of 13% compared with the earlier baseline system.


Author(s):  
M. Jameela ◽  
L. Chen ◽  
A. Sit ◽  
J. Yoo ◽  
C. Verheggen ◽  
...  

Abstract. LiDAR (Light Detection and Ranging) mounted with static and mobile vehicles has been rapidly adopted as a primary sensor for mapping natural and built environments for a range of civil and military applications. Recently, technology advancement in electro-optical engineering enables acquiring laser returns at high pulse repetition frequency (PRF) from 100Hz to 2MHz for airborne LiDAR, which leads to an increase in the density of 3D point cloud significantly. Traditional systems with lower PRF had a single pulse-in-air zone (PIA) big enough to avoid a mismatch between pulse pair at the receiver. Modern multiple pulses-in-air (MPIA) technology ensures multiple windows of operational ranges for single flight line and no blind-zones; downside of the technology is projection of atmospheric returns closer to same PIA zone of neighbouring ground points and more likely to be overlapping with objects of interest. These characteristics of noise compromise the quality of the scene and encourage usage of noise filtering neural network as existing filters are not effective. A noise filtering deep neural network requires a considerable volume of the diverse annotated dataset, which is expensive. We developed simulation for data augmentation based on physical priors and Gaussian generative function. Our study compares deep learning networks for noise filtering and shows performance gain on 3D U-Net. Then, we evaluate 3D U-Net for simulation-based data augmentation, which shows an increase in precision and F1-score. We also provide an analysis of the underline spatial distribution of points and their impact on data augmentation, and noise filtering.


2021 ◽  
Author(s):  
Jianwei Sun ◽  
Zhiyuan Tang ◽  
Hengxin Yin ◽  
Wei Wang ◽  
Xi Zhao ◽  
...  

2021 ◽  
pp. 1-25
Author(s):  
Charles Chen ◽  
Razvan Bunescu ◽  
Cindy Marling

Abstract We propose a new setting for question answering (QA) in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity’s state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained long short-term memory (LSTM)-based architecture to the user’s speech, whereas for the semantic parsing component we introduce an LSTM-based encoder–decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new QA paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.


Sign in / Sign up

Export Citation Format

Share Document