Improving Deep Learning based Automatic Speech Recognition for Gujarati

Author(s):  
Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

2020 ◽  
Vol 8 (5) ◽  
pp. 1677-1681

Stuttering or Stammering is a speech defect within which sounds, syllables, or words are rehashed or delayed, disrupting the traditional flow of speech. Stuttering can make it hard to speak with other individuals, which regularly have an effect on an individual's quality of life. Automatic Speech Recognition (ASR) system is a technology that converts audio speech signal into corresponding text. Presently ASR systems play a major role in controlling or providing inputs to the various applications. Such an ASR system and Machine Translation Application suffers a lot due to stuttering (speech dysfluency). Dysfluencies will affect the phrase consciousness accuracy of an ASR, with the aid of increasing word addition, substitution and dismissal rates. In this work we focused on detecting and removing the prolongation, silent pauses and repetition to generate proper text sequence for the given stuttered speech signal. The stuttered speech recognition consists of two stages namely classification using LSTM and testing in ASR. The major phases of classification system are Re-sampling, Segmentation, Pre-Emphasis, Epoch Extraction and Classification. The current work is carried out in UCLASS Stuttering dataset using MATLAB with 4% to 6% increase in accuracy when compare with ANN and SVM.


2018 ◽  
Vol 29 (1) ◽  
pp. 959-976
Author(s):  
Mohit Dua ◽  
Rajesh Kumar Aggarwal ◽  
Mantosh Biswas

Abstract An automatic speech recognition (ASR) system translates spoken words or utterances (isolated, connected, continuous, and spontaneous) into text format. State-of-the-art ASR systems mainly use Mel frequency (MF) cepstral coefficient (MFCC), perceptual linear prediction (PLP), and Gammatone frequency (GF) cepstral coefficient (GFCC) for extracting features in the training phase of the ASR system. Initially, the paper proposes a sequential combination of all three feature extraction methods, taking two at a time. Six combinations, MF-PLP, PLP-MFCC, MF-GFCC, GF-MFCC, GF-PLP, and PLP-GFCC, are used, and the accuracy of the proposed system using all these combinations was tested. The results show that the GF-MFCC and MF-GFCC integrations outperform all other proposed integrations. Further, these two feature vector integrations are optimized using three different optimization methods, particle swarm optimization (PSO), PSO with crossover, and PSO with quadratic crossover (Q-PSO). The results demonstrate that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations.


Symmetry ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 290 ◽  
Author(s):  
Huseyin Polat ◽  
Saadin Oyucu

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.


2014 ◽  
Vol 679 ◽  
pp. 189-193 ◽  
Author(s):  
Rosemary T. Salaja ◽  
Ronan Flynn ◽  
Michael Russell

Research in speech recognition has produced different approaches that have been used for the classification of speech utterances in the back-end of an automatic speech recognition (ASR) system. As speech recognition is a pattern recognition problem, classification is an important part of any speech recognition system. This paper proposes a new back-end classifier that is based on artificial life (ALife) and describes how the proposed classifier can be used in a speech recognition system.


Author(s):  
Rene Avalloni de Morais ◽  
Baidya Nath Saha

Deep learning algorithms have received dramatic progress in the area of natural language processing and automatic human speech recognition. However, the accuracy of the deep learning algorithms depends on the amount and quality of the data and training deep models requires high-performance computing resources. In this backdrop, this paper adresses an end-to-end speech recognition system where we finetune Mozilla DeepSpeech architecture using two different datasets: LibriSpeech clean dataset and Harvard speech dataset. We train Long Short Term Memory (LSTM) based deep Recurrent Neural Netowrk (RNN) models in Google Colab platform and use their GPU resources. Extensive experimental results demonstrate that Mozilla DeepSpeech model could be fine-tuned for different audio datasets to recognize speeches successfully.


The present manuscript focuses on building automatic speech recognition (ASR) system for Marathi language (M-ASR) using Hidden Markov Model Toolkit (HTK). The M-ASR system gives the detail about experimentation and implementation using the HTK Toolkit. In this work total 106 speaker independent Marathi isolated words were recognized. These unique Marathi words are used to train and evaluate M-ASR system. The speech corpus (database) is created by us using isolated Marathi words uttered with mixed gender people. The system uses Mel Frequency Cepstral Coefficient (MFCC) for the purpose of extracting features using Gaussian mixture model (GMM). Viterbi algorithm based on token passing is used for decoding to recognize unknown utterances. The proposed M-ASR system is speaker independent. The proposed system has reported 96.23% word level recognition accuracy.


Computers ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 76 ◽  
Author(s):  
Laurynas Pipiras ◽  
Rytis Maskeliūnas ◽  
Robertas Damaševičius

Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).


Sign in / Sign up

Export Citation Format

Share Document