Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System

Most of the automatic speech recognition (ASR) systems are trained using adult speech due to the less availability of the children's speech dataset. The speech recognition rate of such systems is very less when tested using the children's speech, due to the presence of the inter-speaker acoustic variabilities between the adults and children's speech. These inter-speaker acoustic variabilities are mainly because of the higher pitch and lower speaking rate of the children. Thus, the main objective of the research work is to increase the speech recognition rate of the Punjabi-ASR system by reducing these inter-speaker acoustic variabilities with the help of prosody modification and speaker adaptive training. The pitch period and duration (speaking rate) of the speech signal can be altered with prosody modification without influencing the naturalness, message of the signal and helps to overcome the acoustic variations present in the adult's and children's speech. The developed Punjabi-ASR system is trained with the help of adult speech and prosody-modified adult speech. This prosody modified speech overcomes the massive need for children's speech for training the ASR system and improves the recognition rate. Results show that prosody modification and speaker adaptive training helps to minimize the word error rate (WER) of the Punjabi-ASR system to 8.79% when tested using children's speech.

Download Full-text

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Sensors ◽

10.3390/s21093063 ◽

2021 ◽

Vol 21 (9) ◽

pp. 3063

Author(s):

Aleksandr Laptev ◽

Andrei Andrusenko ◽

Ivan Podluzhny ◽

Anton Mitrofanov ◽

Ivan Medennikov ◽

...

Keyword(s):

Speech Recognition ◽

Error Rate ◽

Rapid Development ◽

Computational Cost ◽

Vocabulary Size ◽

Word Error Rate ◽

Low Resource ◽

Steady Improvement ◽

End To End ◽

Asr System

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.

Download Full-text

Using Adaptive Filter and Wavelets to Increase Automatic Speech Recognition Rate in Noisy Environment

MICAI 2007: Advances in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-540-76631-5_97 ◽

2007 ◽

pp. 1015-1024 ◽

Cited By ~ 1

Author(s):

José Luis Oropeza Rodríguez ◽

Sergio Suárez Guerra

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Adaptive Filter ◽

Recognition Rate ◽

Noisy Environment

Download Full-text

Exploring the Role of Speaking-Rate Adaptation on Children's Speech Recognition

2018 International Conference on Signal Processing and Communications (SPCOM) ◽

10.1109/spcom.2018.8724478 ◽

2018 ◽

Author(s):

S. Shahnawazuddin ◽

Hemant K. Kathania ◽

Chaman Singh ◽

Waquar Ahmad ◽

Gayadhar Pradhan

Keyword(s):

Speech Recognition ◽

Rate Adaptation ◽

Speaking Rate ◽

Children’S Speech Recognition ◽

Children's Speech

Download Full-text

Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss

Journal of Speech Language and Hearing Research ◽

10.1044/2017_jslhr-s-16-0269 ◽

2017 ◽

Vol 60 (9) ◽

pp. 2394-2405 ◽

Cited By ~ 6

Author(s):

Lionel Fontan ◽

Isabelle Ferrané ◽

Jérôme Farinas ◽

Julien Pinquier ◽

Julien Tardieu ◽

...

Keyword(s):

Hearing Loss ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Hearing Aids ◽

Speech Processing ◽

Fine Tuning ◽

Language Models ◽

Age Related ◽

Age Related Hearing Loss ◽

Asr System

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.

Download Full-text

Enhanced Automatic Speech Recognition System Based on Enhancing Power-Normalized Cepstral Coefficients

Applied Sciences ◽

10.3390/app9102166 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2166 ◽

Cited By ~ 3

Author(s):

Mohamed Tamazin ◽

Ahmed Gouda ◽

Mohamed Khedr

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Additive White Gaussian Noise ◽

Recognition Rate ◽

Data Entry ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

Different Types ◽

A New Technique

Many new consumer applications are based on the use of automatic speech recognition (ASR) systems, such as voice command interfaces, speech-to-text applications, and data entry processes. Although ASR systems have remarkably improved in recent decades, the speech recognition system performance still significantly degrades in the presence of noisy environments. Developing a robust ASR system that can work in real-world noise and other acoustic distorting conditions is an attractive research topic. Many advanced algorithms have been developed in the literature to deal with this problem; most of these algorithms are based on modeling the behavior of the human auditory system with perceived noisy speech. In this research, the power-normalized cepstral coefficient (PNCC) system is modified to increase robustness against the different types of environmental noises, where a new technique based on gammatone channel filtering combined with channel bias minimization is used to suppress the noise effects. The TIDIGITS database is utilized to evaluate the performance of the proposed system in comparison to the state-of-the-art techniques in the presence of additive white Gaussian noise (AWGN) and seven different types of environmental noises. In this research, one word is recognized from a set containing 11 possibilities only. The experimental results showed that the proposed method provides significant improvements in the recognition accuracy at low signal to noise ratios (SNR). In the case of subway noise at SNR = 5 dB, the proposed method outperforms the mel-frequency cepstral coefficient (MFCC) and relative spectral (RASTA)–perceptual linear predictive (PLP) methods by 55% and 47%, respectively. Moreover, the recognition rate of the proposed method is higher than the gammatone frequency cepstral coefficient (GFCC) and PNCC methods in the case of car noise. It is enhanced by 40% in comparison to the GFCC method at SNR 0dB, while it is improved by 20% in comparison to the PNCC method at SNR −5dB.

Download Full-text

Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition

2009 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2009.4960461 ◽

2009 ◽

Cited By ~ 23

Author(s):

Ozlem Kalinli ◽

Michael L. Seltzer ◽

Alex Acero

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Taylor Series ◽

Adaptive Training ◽

Noise Robust

Download Full-text

Isolated word Automatic Speech Recognition (ASR) System using MFCC, DTW & KNN

2016 Asia Pacific Conference on Multimedia and Broadcasting (APMediaCast) ◽

10.1109/apmediacast.2016.7878163 ◽

2016 ◽

Cited By ~ 3

Author(s):

Muhammad Atif Imtiaz ◽

Gulistan Raja

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Isolated Word ◽

Asr System

Download Full-text

Indigenuous Vocabulary Reformulation for Continuousyorùbá Speech Recognition In M-Commerce Using Acoustic Nudging-Based Gaussian Mixture Model

10.21203/rs.3.rs-211622/v1 ◽

2021 ◽

Author(s):

Kehinde Lydia Ajayi ◽

Victor Azeta ◽

Isaac Odun-Ayo ◽

Ambrose Azeta ◽

Ajayi Peter Taiwo ◽

...

Keyword(s):

Speech Recognition ◽

Gaussian Mixture Model ◽

Mixture Model ◽

Error Rate ◽

System Performance ◽

Recognition Rate ◽

Gaussian Mixture ◽

Computer Applications ◽

Word Error Rate ◽

The Mean

Abstract One of the current research areas is speech recognition by aiding in the recognition of speech signals through computer applications. In this research paper, Acoustic Nudging, (AN) Model is used in re-formulating the persistence automatic speech recognition (ASR) errors that involves user’s acoustic irrational behavior which alters speech recognition accuracy. GMM helped in addressing low-resourced attribute of Yorùbá language to achieve better accuracy and system performance. From the simulated results given, it is observed that proposed Acoustic Nudging-based Gaussian Mixture Model (ANGM) improves accuracy and system performance which is evaluated based on Word Recognition Rate (WRR) and Word Error Rate (WER)given by validation accuracy, testing accuracy, and training accuracy. The evaluation results for the mean WRR accuracy achieved for the ANGM model is 95.277% and the mean Word Error Rate (WER) is 4.723%when compared to existing models. This approach thereby reduce error rate by 1.1%, 0.5%, 0.8%, 0.3%, and 1.4% when compared with other models. Therefore this work was able to discover a foundation for advancing current understanding of under-resourced languages and at the same time, development of accurate and precise model for speech recognition.

Download Full-text

Automatic Speech Recognition with Stuttering Speech Removal using Long Short-Term Memory (LSTM)

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6230.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1677-1681

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Speech Signal ◽

Short Term Memory ◽

Long Short Term Memory ◽

Increase In Accuracy ◽

Two Stages ◽

The Given ◽

Asr System

Stuttering or Stammering is a speech defect within which sounds, syllables, or words are rehashed or delayed, disrupting the traditional flow of speech. Stuttering can make it hard to speak with other individuals, which regularly have an effect on an individual's quality of life. Automatic Speech Recognition (ASR) system is a technology that converts audio speech signal into corresponding text. Presently ASR systems play a major role in controlling or providing inputs to the various applications. Such an ASR system and Machine Translation Application suffers a lot due to stuttering (speech dysfluency). Dysfluencies will affect the phrase consciousness accuracy of an ASR, with the aid of increasing word addition, substitution and dismissal rates. In this work we focused on detecting and removing the prolongation, silent pauses and repetition to generate proper text sequence for the given stuttered speech signal. The stuttered speech recognition consists of two stages namely classification using LSTM and testing in ASR. The major phases of classification system are Re-sampling, Segmentation, Pre-Emphasis, Epoch Extraction and Classification. The current work is carried out in UCLASS Stuttering dataset using MATLAB with 4% to 6% increase in accuracy when compare with ANN and SVM.

Download Full-text

Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs

10.17760/d10019273 ◽

2009 ◽

Author(s):

Harish Kashyap Krishnamurthy

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Asr System

Download Full-text