Improving speech recognition using limited accent diverse British English training data with deep neural networks

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

10.21437/interspeech.2014-214 ◽

2014 ◽

Author(s):

Yan Huang ◽

Malcolm Slaney ◽

Michael L. Seltzer ◽

Yifan Gong

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Training Data ◽

Acoustic Modeling

Download Full-text

Small-Footprint Deep Neural Networks with Highway Connections for Speech Recognition

10.21437/interspeech.2016-39 ◽

2016 ◽

Cited By ~ 5

Author(s):

Liang Lu ◽

Steve Renals

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Small Footprint

Download Full-text

An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition

10.21437/interspeech.2013-278 ◽

2013 ◽

Cited By ~ 1

Author(s):

Bo Li ◽

Yu Tsao ◽

Khe Chai Sim

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Robust Speech Recognition ◽

Noise Robust Speech Recognition ◽

Noise Robust ◽

Restoration Algorithms

Download Full-text

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i03.5649 ◽

2020 ◽

Vol 34 (03) ◽

pp. 2645-2652 ◽

Cited By ~ 2

Author(s):

Yaman Kumar ◽

Dhruva Sahrawat ◽

Shubham Maheshwari ◽

Debanjan Mahata ◽

Amanda Stent ◽

...

Keyword(s):

Speech Recognition ◽

Classification Problem ◽

Visual Speech ◽

Training Data ◽

Generative Adversarial Networks ◽

Adversarial Networks ◽

Novel Approach ◽

Visual Speech Recognition ◽

Training Samples ◽

English Training

Visual Speech Recognition (VSR) is the process of recognizing or interpreting speech by watching the lip movements of the speaker. Recent machine learning based approaches model VSR as a classification problem; however, the scarcity of training data leads to error-prone systems with very low accuracies in predicting unseen classes. To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases. We also show that our models are language agnostic and therefore capable of seamlessly generating, using English training data, videos for a new language (Hindi). To the best of our knowledge, this is the first work to show empirical evidence of the use of GANs for generating training samples of unseen classes in the domain of VSR, hence facilitating zero-shot learning. We make the added videos for new classes publicly available along with our code1.

Download Full-text