Improving Deep Learning based Automatic Speech Recognition for Gujarati

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Download Full-text

Automatic Speech Recognition with Stuttering Speech Removal using Long Short-Term Memory (LSTM)

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6230.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1677-1681

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Speech Signal ◽

Short Term Memory ◽

Long Short Term Memory ◽

Increase In Accuracy ◽

Two Stages ◽

The Given ◽

Asr System

Stuttering or Stammering is a speech defect within which sounds, syllables, or words are rehashed or delayed, disrupting the traditional flow of speech. Stuttering can make it hard to speak with other individuals, which regularly have an effect on an individual's quality of life. Automatic Speech Recognition (ASR) system is a technology that converts audio speech signal into corresponding text. Presently ASR systems play a major role in controlling or providing inputs to the various applications. Such an ASR system and Machine Translation Application suffers a lot due to stuttering (speech dysfluency). Dysfluencies will affect the phrase consciousness accuracy of an ASR, with the aid of increasing word addition, substitution and dismissal rates. In this work we focused on detecting and removing the prolongation, silent pauses and repetition to generate proper text sequence for the given stuttered speech signal. The stuttered speech recognition consists of two stages namely classification using LSTM and testing in ASR. The major phases of classification system are Re-sampling, Segmentation, Pre-Emphasis, Epoch Extraction and Classification. The current work is carried out in UCLASS Stuttering dataset using MATLAB with 4% to 6% increase in accuracy when compare with ANN and SVM.

Download Full-text

Development of Isolated Numeric Speech Corpus for Swahili Language for Development of Automatic Speech Recognition System

International Journal of Computer Applications ◽

10.5120/12929-9841 ◽

2013 ◽

Vol 74 (11) ◽

pp. 20-22 ◽

Cited By ~ 1

Author(s):

Aaron M.Oirere ◽

Ratnadeep R. Deshmukh ◽

Pukhraj P. Shrishrimal

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Speech Corpus ◽

Automatic Speech Recognition System

Download Full-text

Optimizing Integrated Features for Hindi Automatic Speech Recognition System

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0057 ◽

2018 ◽

Vol 29 (1) ◽

pp. 959-976

Author(s):

Mohit Dua ◽

Rajesh Kumar Aggarwal ◽

Mantosh Biswas

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Linear Prediction ◽

Optimization Methods ◽

Recognition System ◽

Extraction Methods ◽

Automatic Speech Recognition System ◽

Sequential Combination ◽

Perceptual Linear Prediction ◽

Asr System

Abstract An automatic speech recognition (ASR) system translates spoken words or utterances (isolated, connected, continuous, and spontaneous) into text format. State-of-the-art ASR systems mainly use Mel frequency (MF) cepstral coefficient (MFCC), perceptual linear prediction (PLP), and Gammatone frequency (GF) cepstral coefficient (GFCC) for extracting features in the training phase of the ASR system. Initially, the paper proposes a sequential combination of all three feature extraction methods, taking two at a time. Six combinations, MF-PLP, PLP-MFCC, MF-GFCC, GF-MFCC, GF-PLP, and PLP-GFCC, are used, and the accuracy of the proposed system using all these combinations was tested. The results show that the GF-MFCC and MF-GFCC integrations outperform all other proposed integrations. Further, these two feature vector integrations are optimized using three different optimization methods, particle swarm optimization (PSO), PSO with crossover, and PSO with quadratic crossover (Q-PSO). The results demonstrate that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations.

Download Full-text

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

Symmetry ◽

10.3390/sym12020290 ◽

2020 ◽

Vol 12 (2) ◽

pp. 290 ◽

Cited By ~ 2

Author(s):

Huseyin Polat ◽

Saadin Oyucu

Keyword(s):

Speech Recognition ◽

Language Model ◽

Recognition Task ◽

Gaussian Mixture ◽

Speech Corpus ◽

Text Corpus ◽

Training Models ◽

Relative Contribution ◽

The Status ◽

Asr System

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.

Download Full-text

An automatic speech recognition system for spontaneous Punjabi speech corpus

International Journal of Speech Technology ◽

10.1007/s10772-017-9408-2 ◽

2017 ◽

Vol 20 (2) ◽

pp. 297-303 ◽

Cited By ~ 7

Author(s):

Yogesh Kumar ◽

Navdeep Singh

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Speech Corpus ◽

Automatic Speech Recognition System

Download Full-text

Speech Vision: An End-to-End Deep Learning-based Dysarthric Automatic Speech Recognition System

IEEE Transactions on Neural Systems and Rehabilitation Engineering ◽

10.1109/tnsre.2021.3076778 ◽

2021 ◽

pp. 1-1

Author(s):

Seyed Reza Shahamiri

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

End To End

Download Full-text

A Life-Based Classifier for Automatic Speech Recognition

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.679.189 ◽

2014 ◽

Vol 679 ◽

pp. 189-193 ◽

Cited By ~ 1

Author(s):

Rosemary T. Salaja ◽

Ronan Flynn ◽

Michael Russell

Keyword(s):

Pattern Recognition ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Artificial Life ◽

Recognition System ◽

Recognition Problem ◽

Speech Recognition System ◽

Pattern Recognition Problem ◽

Asr System

Research in speech recognition has produced different approaches that have been used for the classification of speech utterances in the back-end of an automatic speech recognition (ASR) system. As speech recognition is a pattern recognition problem, classification is an important part of any speech recognition system. This paper proposes a new back-end classifier that is based on artificial life (ALife) and describes how the proposed classifier can be used in a speech recognition system.

Download Full-text

End-to-End Speech Recognition Using Recurrent Neural Network (RNN)

Proceedings of Intelligent Computing and Technologies Conference ◽

10.21467/proceedings.115.20 ◽

2021 ◽

Author(s):

Rene Avalloni de Morais ◽

Baidya Nath Saha

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Language Processing ◽

High Performance ◽

Short Term Memory ◽

Learning Algorithms ◽

Recognition System ◽

End To End ◽

Performance Computing

Deep learning algorithms have received dramatic progress in the area of natural language processing and automatic human speech recognition. However, the accuracy of the deep learning algorithms depends on the amount and quality of the data and training deep models requires high-performance computing resources. In this backdrop, this paper adresses an end-to-end speech recognition system where we finetune Mozilla DeepSpeech architecture using two different datasets: LibriSpeech clean dataset and Harvard speech dataset. We train Long Short Term Memory (LSTM) based deep Recurrent Neural Netowrk (RNN) models in Google Colab platform and use their GPU resources. Extensive experimental results demonstrate that Mozilla DeepSpeech model could be fine-tuned for different audio datasets to recognize speeches successfully.

Download Full-text

Automatic Speech Recognition (ASR) System for Isolated Marathi Words: using HTK

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l2651.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 3702-3705

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Viterbi Algorithm ◽

Gaussian Mixture ◽

Speech Corpus ◽

Word Level ◽

Speaker Independent ◽

Token Passing ◽

Mel Frequency Cepstral Coefficient ◽

Asr System

The present manuscript focuses on building automatic speech recognition (ASR) system for Marathi language (M-ASR) using Hidden Markov Model Toolkit (HTK). The M-ASR system gives the detail about experimentation and implementation using the HTK Toolkit. In this work total 106 speaker independent Marathi isolated words were recognized. These unique Marathi words are used to train and evaluate M-ASR system. The speech corpus (database) is created by us using isolated Marathi words uttered with mixed gender people. The system uses Mel Frequency Cepstral Coefficient (MFCC) for the purpose of extracting features using Gaussian mixture model (GMM). Viterbi algorithm based on token passing is used for decoding to recognize unknown utterances. The proposed M-ASR system is speaker independent. The proposed system has reported 96.23% word level recognition accuracy.

Download Full-text

Lithuanian Speech Recognition Using Purely Phonetic Deep Learning

Computers ◽

10.3390/computers8040076 ◽

2019 ◽

Vol 8 (4) ◽

pp. 76 ◽

Cited By ~ 3

Author(s):

Laurynas Pipiras ◽

Rytis Maskeliūnas ◽

Robertas Damaševičius

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition Task ◽

Data Models ◽

Attention Mechanism ◽

Large Majority ◽

Learning Methods ◽

Language Structure ◽

Asr System

Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).

Download Full-text