scholarly journals Developing Speech Recognition System for Quranic Verse Recitation Learning Software

Author(s):  
Budiman Putra ◽  
B. Atmaja ◽  
D. Prananto

Quran as holy book for Muslim consists of many rules which are needed to be considered in reading Quran verse properly. If the recitation does not meet all of those rules, the meaning of Quran verse recited will be different with its origins. Intensive learning is needed to be able to do correct recitation. However, the limitation of teachers and time to study Quran verse recitation together in a class could be an obstacle in Quran recitation learning. In order to minimize the obstacle and to ease the learning process we implement speech recognition techniques based on Mel Frequency Cepstral Coefficient (MFCC) features and Gaussian Mixture Model (GMM) modeling, we have successfully designed and developed Quran verse recitation learning software in prototype stage. This software is interactive multimedia software which has many features for learning flexibility and effectiveness. This paper explains the developing of speech recognition system for Quran learning software which is built with the ability to perform evaluation and correction in Quran recitation. In this paper, the authors present clearly the built and tested prototype of the system based on experiment data.

2019 ◽  
Vol 8 (3) ◽  
pp. 7827-7831

Kannada is the regional language of India spoken in Karnataka. This paper presents development of continuous kannada speech recognition system using monophone modelling and triphone modelling using HTK. Mel Frequency Cepstral Coefficient (MFCC) is used as feature extractor, exploits cepstral and perceptual frequency scale leads good recognition accuracy. Hidden Markov Model is used as classifier. In this paper Gaussian mixture splitting is done that captures the variations of the phones. The paper presents performance of continuous Kannada Automatic Speech Recognition (ASR) system with respect to 2, 4,8,16 and 32 Gaussian mixtures with monophone and context dependent tri-phone modelling. The experimental result shows that good recognition accuracy is achieved for context dependent tri-phone modelling than monophone modelling as the number Gaussian mixture is increased.


Author(s):  
Masoud Geravanchizadeh ◽  
Elnaz Forouhandeh ◽  
Meysam Bashirpour

AbstractThe performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.


2012 ◽  
Vol 2012 ◽  
pp. 1-9
Author(s):  
Peng Dai ◽  
Ing Yann Soon ◽  
Rui Tao

A new log-power domain feature enhancement algorithm named NLPS is developed. It consists of two parts, direct solution of nonlinear system model and log-power subtraction. In contrast to other methods, the proposed algorithm does not need prior speech/noise statistical model. Instead, it works by direct solution of the nonlinear function derived from the speech recognition system. Separate steps are utilized to refine the accuracy of estimated cepstrum by log-power subtraction, which is the second part of the proposed algorithm. The proposed algorithm manages to solve the speech probability distribution function (PDF) discontinuity problem caused by traditional spectral subtraction series algorithms. The effectiveness of the proposed filter is extensively compared using the standard database, AURORA2. The results show that significant improvement can be achieved by incorporating the proposed algorithm. The proposed algorithm reaches a recognition rate of over 86% for noisy speech (average from SNR 0 dB to 20 dB), which means a 48% error reduction over the baseline Mel-frequency Cepstral Coefficient (MFCC) system.


Sign in / Sign up

Export Citation Format

Share Document