Automatic Speech Recognition (ASR) System for Isolated Marathi Words: using HTK

The present manuscript focuses on building automatic speech recognition (ASR) system for Marathi language (M-ASR) using Hidden Markov Model Toolkit (HTK). The M-ASR system gives the detail about experimentation and implementation using the HTK Toolkit. In this work total 106 speaker independent Marathi isolated words were recognized. These unique Marathi words are used to train and evaluate M-ASR system. The speech corpus (database) is created by us using isolated Marathi words uttered with mixed gender people. The system uses Mel Frequency Cepstral Coefficient (MFCC) for the purpose of extracting features using Gaussian mixture model (GMM). Viterbi algorithm based on token passing is used for decoding to recognize unknown utterances. The proposed M-ASR system is speaker independent. The proposed system has reported 96.23% word level recognition accuracy.

Download Full-text

UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3194-3202 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3194 ◽

Cited By ~ 1

Author(s):

Aye Nyein Mon ◽

Win Pa Pa ◽

Ye Kyaw Thu

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Error Rates ◽

Training Data ◽

Speech Corpus ◽

Total Size ◽

Test Sets ◽

Web News

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200730225301 ◽

2020 ◽

Vol 13 ◽

Author(s):

Mohit Dua ◽

Pawandeep Singh Sethi ◽

Vinam Agrawal ◽

Raghav Chawla

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Performance Comparison ◽

Acoustic Modeling ◽

Extraction Techniques ◽

Front End ◽

Noise Robust ◽

Asr System

Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel's, Dan's and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan's DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan's DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan's DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions

Download Full-text

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

Symmetry ◽

10.3390/sym12020290 ◽

2020 ◽

Vol 12 (2) ◽

pp. 290 ◽

Cited By ~ 2

Author(s):

Huseyin Polat ◽

Saadin Oyucu

Keyword(s):

Speech Recognition ◽

Language Model ◽

Recognition Task ◽

Gaussian Mixture ◽

Speech Corpus ◽

Text Corpus ◽

Training Models ◽

Relative Contribution ◽

The Status ◽

Asr System

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.

Download Full-text

Improving Deep Learning based Automatic Speech Recognition for Gujarati

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483446 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-18

Author(s):

Deepang Raval ◽

Vyom Pathak ◽

Muktan Patel ◽

Brijesh Bhatt

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Short Term Memory ◽

Language Model ◽

Recognition System ◽

Processing Technique ◽

Speech Corpus ◽

Novel Approach ◽

Asr System

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Download Full-text

Phonetically rich and balanced speech corpus for Arabic speaker-independent continuous automatic speech recognition systems

10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010) ◽

10.1109/isspa.2010.5605554 ◽

2010 ◽

Cited By ~ 7

Author(s):

Mohammad A. M. Abushariah ◽

Raja N. Ainon ◽

Roziati Zainuddin ◽

Moustafa Elshafei ◽

Othman O. Khalifa

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Speech Corpus ◽

Speaker Independent ◽

Recognition Systems ◽

Arabic Speaker

Download Full-text

Estimation of ASR Parameterization for Interactive System

International Journal of Natural Computing Research ◽

10.4018/ijncr.2021010103 ◽

2021 ◽

Vol 10 (1) ◽

pp. 28-40

Author(s):

Mohamed Hamidi ◽

Hassan Satori ◽

Ouissam Zealouk ◽

Naouar Laaidi

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Markov Models ◽

Interactive Voice Response ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Interactive System ◽

Interactive Applications ◽

Speaker Independent ◽

Spectral Coefficients

In this study, the authors explore the integration of speaker-independent automatic Amazigh speech recognition technology into interactive applications to extract data remotely from a distance database. Based on the combined interactive voice response (IVR) and automatic speech recognition (ASR) technologies, the authors built an interactive speech system to allow users to interact with the interactive system through voice commands. The hidden Markov models (HMMs), Gaussian mixture models (GMMs), and Mel frequency spectral coefficients (MFCCs) are used to develop a speech system based on the ten first Amazigh digits and six Amazigh words. The best-obtained performance is 89.64% by using 3 HMMs and 16 GMMs.

Download Full-text

Chhattisgarhi speech corpus for research and development in automatic speech recognition

International Journal of Speech Technology ◽

10.1007/s10772-018-9496-7 ◽

2018 ◽

Vol 21 (2) ◽

pp. 193-210 ◽

Cited By ~ 2

Author(s):

Narendra D. Londhe ◽

Ghanahshyam B. Kshirsagar

Keyword(s):

Speech Recognition ◽

Research And Development ◽

Automatic Speech Recognition ◽

Speech Corpus

Download Full-text

Development of Automatic Speech Recognition for Xitsonga Using Subspace Gaussian Mixture Model

10.1109/icabcd51485.2021.9519355 ◽

2021 ◽

Author(s):

Vukosi Rikhotso ◽

Thipe Modipa ◽

Madimetja Jonas Manamela ◽

Tumisho Bilson Mokgonyane

Keyword(s):

Speech Recognition ◽

Gaussian Mixture Model ◽

Mixture Model ◽

Automatic Speech Recognition ◽

Gaussian Mixture

Download Full-text

Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss

Journal of Speech Language and Hearing Research ◽

10.1044/2017_jslhr-s-16-0269 ◽

2017 ◽

Vol 60 (9) ◽

pp. 2394-2405 ◽

Cited By ~ 6

Author(s):

Lionel Fontan ◽

Isabelle Ferrané ◽

Jérôme Farinas ◽

Julien Pinquier ◽

Julien Tardieu ◽

...

Keyword(s):

Hearing Loss ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Hearing Aids ◽

Speech Processing ◽

Fine Tuning ◽

Language Models ◽

Age Related ◽

Age Related Hearing Loss ◽

Asr System

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.

Download Full-text

Isolated word Automatic Speech Recognition (ASR) System using MFCC, DTW & KNN

2016 Asia Pacific Conference on Multimedia and Broadcasting (APMediaCast) ◽

10.1109/apmediacast.2016.7878163 ◽

2016 ◽

Cited By ~ 3

Author(s):

Muhammad Atif Imtiaz ◽

Gulistan Raja

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Isolated Word ◽

Asr System

Download Full-text