automatic speech recognition system
Recently Published Documents


TOTAL DOCUMENTS

174
(FIVE YEARS 60)

H-INDEX

9
(FIVE YEARS 3)

2021 ◽  
Vol 1 (1) ◽  
pp. 453-478
Author(s):  
Heriyanto Heriyanto ◽  
Herlina Jayadianti ◽  
Juwairiah Juwairiah

There are two approaches to Qur’an recitation, namely talaqqi and qira'ati. Both approaches use the science of recitation containing knowledge of the rules and procedures for reading the Qur'an properly. Talaqqi requires the teacher and students to sit facing each other while qira'ati is the recitation of the Qur'an with rhythms and tones. Many studies have developed an automatic speech recognition system for Qur’an recitation to help the learning process. Feature extraction model using Mel Frequency Cepstral Coefficient (MFCC) and Linear Predictive Code (LPC). The MFCC method has an accuracy of 50% to 60% while the accuracy of Linear Predictive Code (LPC) is only 45% to 50%, so the non-linear MFCC method has higher accuracy than the linear approach method. The cepstral coefficient feature that is used starts from 0 to 23 or 24 cepstral coefficients. Meanwhile, the frame taken consists of 0 to 10 frames or eleven frames. Voting for 300 recorded voice samples was tested against 200 voice recordings, both male and female voices. The frequency used was 44.100 kHz stereo 16 bit. This study aims to obtain good accuracy by selecting the right feature on the cepstral coefficient using MFCC feature extraction and matching accuracy through the selection of the cepstral coefficient feature with Dominant Weight Normalization (NBD) at TPA Nurul Huda Plus Purbayan. Accuracy results showed that the MFCC method with the selection of the 23rd cepstral coefficient has a higher accuracy rate of 90.2% compared to the others. It can be concluded that the selection of the right features on the 23rd cepstral coefficient affects the accuracy of the voice of Qur’an recitation.


Author(s):  
Masoud Geravanchizadeh ◽  
Elnaz Forouhandeh ◽  
Meysam Bashirpour

AbstractThe performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.


2021 ◽  
Vol 3 ◽  
Author(s):  
Roozbeh Sadeghian ◽  
J. David Schaffer ◽  
Stephen A. Zahorian

Automatic Speech Recognition (ASR) is widely used in many applications and tools. Smartphones, video games, and cars are a few examples where people use ASR routinely and often daily. A less commonly used, but potentially very important arena for using ASR, is the health domain. For some people, the impact on life could be enormous. The goal of this work is to develop an easy-to-use, non-invasive, inexpensive speech-based diagnostic test for dementia that can easily be applied in a clinician’s office or even at home. While considerable work has been published along these lines, increasing dramatically recently, it is primarily of theoretical value and not yet practical to apply. A large gap exists between current scientific understanding, and the creation of a diagnostic test for dementia. The aim of this paper is to bridge this gap between theory and practice by engineering a practical test. Experimental evidence suggests that strong discrimination between subjects with a diagnosis of probable Alzheimer’s vs. matched normal controls can be achieved with a combination of acoustic features from speech, linguistic features extracted from a transcription of the speech, and results of a mini mental state exam. A fully automatic speech recognition system tuned for the speech-to-text aspect of this application, including automatic punctuation, is also described.


Sign in / Sign up

Export Citation Format

Share Document