Improving speech recognition using limited accent diverse British English training data with deep neural networks

Author(s):  
Maryam Najafian ◽  
Saeid Safavi ◽  
John H. L. Hansen ◽  
Martin Russell
2020 ◽  
Vol 10 (6) ◽  
pp. 2104
Author(s):  
Michał Tomaszewski ◽  
Paweł Michalski ◽  
Jakub Osuchowski

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.


2020 ◽  
Vol 34 (03) ◽  
pp. 2645-2652 ◽  
Author(s):  
Yaman Kumar ◽  
Dhruva Sahrawat ◽  
Shubham Maheshwari ◽  
Debanjan Mahata ◽  
Amanda Stent ◽  
...  

Visual Speech Recognition (VSR) is the process of recognizing or interpreting speech by watching the lip movements of the speaker. Recent machine learning based approaches model VSR as a classification problem; however, the scarcity of training data leads to error-prone systems with very low accuracies in predicting unseen classes. To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases. We also show that our models are language agnostic and therefore capable of seamlessly generating, using English training data, videos for a new language (Hindi). To the best of our knowledge, this is the first work to show empirical evidence of the use of GANs for generating training samples of unseen classes in the domain of VSR, hence facilitating zero-shot learning. We make the added videos for new classes publicly available along with our code1.


Sign in / Sign up

Export Citation Format

Share Document