Test Of Vowels In Speech Recognition Using Continuous Density Hidden Markov Model And Development Of Phonetically Balanced-Words In The Filipino Language

AbstractAn Automatic Speech Recognition (ASR) converts the speech signals into words. The recognized words can be the final output or it can be an input for a natural language processing. In this paper, vowel recognizer using Continuous density HMM and Mel-Frequency Cepstral Coefficient (MFCC) were used for feature extraction for its development, and phonetically balanced words (PBW) in Filipino were developed. Thus, this study is a preparation for Filipino Language ASR using HMM. For vowel recognizer, forty speakers were trained (20 male and 20 female speakers). An average accuracy rate of 94.5% was achieved for speaker-dependent test and 90.8% for speaker independent test. For PBW, 2 word lists were developed consisting of 257 words for the 2-syllable Filipino PBW word list and 212 words for the 3-syllable Filipino PBW word list.

Download Full-text

Speaker Independent Continuous Speech Recognition Using Continuous Density Hidden Markov Models

Speech Recognition and Understanding ◽

10.1007/978-3-642-76626-8_16 ◽

1992 ◽

pp. 135-163 ◽

Cited By ~ 2

Author(s):

Chin-Hui Lee ◽

Lawrence R. Rabiner ◽

Roberto Pieraccini

Keyword(s):

Speech Recognition ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Continuous Density ◽

Speaker Independent

Download Full-text

Classification of Javanese Script Hanacara Voice Using Mel Frequency Cepstral Coefficient MFCC and Selection of Dominant Weight Features

JURNAL INFOTEL ◽

10.20895/infotel.v13i2.657 ◽

2021 ◽

Vol 13 (2) ◽

pp. 84-93

Author(s):

Heriyanto Heriyanto ◽

Tenia Wahyuningrum ◽

Gita Fadila Fitriana

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Female Voice ◽

Dominant Weight ◽

Accuracy Rate ◽

Male And Female ◽

The Right ◽

Mel Frequency Cepstral Coefficient ◽

Selection Of

This study investigates the sound of Hanacaraka in Javanese to select the best frame feature in checking the reading sound. Selection of the right frame feature is needed in speech recognition because certain frames have accuracy at their dominant weight, so it is necessary to match frames with the best accuracy. Common and widely used feature extraction models include the Mel Frequency Cepstral Coefficient (MFCC). The MFCC method has an accuracy of 50% to 60%. This research uses MFCC and the selection of Dominant Weight features for the Javanese language script sound Hanacaraka which produces a frame and cepstral coefficient as feature extraction. The use of the cepstral coefficient ranges from 0 to 23 or as many as 24 cepstral coefficients. In comparison, the captured frame consists of 0 to 10 frames or consists of eleven frames. A sound sampling of 300 recorded voice sampling was tested on 300 voice recordings of both male and female voice recordings. The frequency used is 44,100 kHz 16-bit stereo. The accuracy results show that the MFCC method with the ninth frame selection has a higher accuracy rate of 86% than other frames.

Download Full-text

Real-Time Indonesian Language Speech Recognition with MFCC Algorithms and Python-Based SVM

IJITEE (International Journal of Information Technology and Electrical Engineering) ◽

10.22146/ijitee.49426 ◽

2019 ◽

Vol 3 (2) ◽

pp. 55

Author(s):

Wening Mustikarini ◽

Risanuri Hidayat ◽

Agus Bejo

Keyword(s):

Speech Recognition ◽

Recognition Rate ◽

Recognition System ◽

Training Data ◽

Support Vector ◽

Data Set ◽

Human Voice ◽

Average Accuracy ◽

Speech Data ◽

Mel Frequency Cepstral Coefficient

Abstract — Automatic Speech Recognition (ASR) is a technology that uses machines to process and recognize human voice. One way to increase recognition rate is to use a model of language you want to recognize. In this paper, a speech recognition application is introduced to recognize words "atas" (up), "bawah" (down), "kanan" (right), and "kiri" (left). This research used 400 samples of speech data, 75 samples from each word for training data and 25 samples for each word for test data. This speech recognition system was designed using Mel Frequency Cepstral Coefficient (MFCC) as many as 13 coefficients as features and Support Vector Machine (SVM) as identifiers. The system was tested with linear kernels and RBF, various cost values, and three sample sizes (n = 25, 75, 50). The best average accuracy value was obtained from SVM using linear kernels, a cost value of 100 and a data set consisted of 75 samples from each class. During the training phase, the system showed a f1-score (trade-off value between precision and recall) of 80% for the word "atas", 86% for the word "bawah", 81% for the word "kanan", and 100% for the word "kiri". Whereas by using 25 new samples per class for system testing phase, the f1-score was 76% for the "atas" class, 54% for the "bawah" class, 44% for the "kanan" class, and 100% for the "kiri" class.

Download Full-text

Automatic Speech Recognition (ASR) System for Isolated Marathi Words: using HTK

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l2651.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 3702-3705

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Viterbi Algorithm ◽

Gaussian Mixture ◽

Speech Corpus ◽

Word Level ◽

Speaker Independent ◽

Token Passing ◽

Mel Frequency Cepstral Coefficient ◽

Asr System

The present manuscript focuses on building automatic speech recognition (ASR) system for Marathi language (M-ASR) using Hidden Markov Model Toolkit (HTK). The M-ASR system gives the detail about experimentation and implementation using the HTK Toolkit. In this work total 106 speaker independent Marathi isolated words were recognized. These unique Marathi words are used to train and evaluate M-ASR system. The speech corpus (database) is created by us using isolated Marathi words uttered with mixed gender people. The system uses Mel Frequency Cepstral Coefficient (MFCC) for the purpose of extracting features using Gaussian mixture model (GMM). Viterbi algorithm based on token passing is used for decoding to recognize unknown utterances. The proposed M-ASR system is speaker independent. The proposed system has reported 96.23% word level recognition accuracy.

Download Full-text

Isolated Word based Spoken Dialogue System using Odia Phones

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666191218114024 ◽

2019 ◽

Vol 13 ◽

Author(s):

Basanta Kuamr Swain ◽

Sanghamitra Mohanty ◽

Chiranji Lal Chowdhary

Keyword(s):

Speech Recognition ◽

Speaker Verification ◽

Dialogue System ◽

Accuracy Rate ◽

Spoken Dialogue ◽

Spoken Dialogue System ◽

Fuzzy C Means ◽

Average Accuracy ◽

Isolated Word ◽

Speech Recognition Engine

: In this research paper, we have developed a spoken dialogue system using Odia phone set. We have also added additional security feature to our developed spoken dialogue system by integrating with speaker verification module, which allows the services to only genuine users. The spoken dialogue system can give the bouquet of services relating to opening of frequently usage applications, files and folders that are either installed or stored in user’s computers. The spoken dialogue system also responds to the users in synthesized speech mode relating to the service. The spoken dialogue system can be used to keep the desktop of computer from free of clutter. We have used HMM based Odia isolated word speech recognition engine and fuzzy c-means based speaker verification module in development of spoken dialogue system. The accuracy of Odia speech recognition engine is found as 78.22 % and 62.31% for seen and unseen users respectively and the average accuracy rate of speaker verification module is found as 66.2%.

Download Full-text

Hierarchical Phoneme Classification for Improved Speech Recognition

Applied Sciences ◽

10.3390/app11010428 ◽

2021 ◽

Vol 11 (1) ◽

pp. 428

Author(s):

Donghoon Oh ◽

Jeong-Sik Park ◽

Ji-Hwan Kim ◽

Gil-Jin Jang

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Confusion Matrix ◽

Critical Factor ◽

Recognition System ◽

Classification Performance ◽

Language Models ◽

Successful Implementation ◽

Phoneme Classification ◽

Improved Performance

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Download Full-text

Speaker-Independent Visual Speech Recognition with the Inception V3 Model

2021 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt48900.2021.9383540 ◽

2021 ◽

Author(s):

Timothy Israel Santos ◽

Andrew Abel ◽

Nick Wilson ◽

Yan Xu

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition ◽

Speaker Independent

Download Full-text

In search for the relevant parameters for speaker independent speech recognition

IEEE International Conference on Acoustics Speech and Signal Processing ◽

10.1109/icassp.1993.319403 ◽

1993 ◽

Cited By ~ 4

Author(s):

J. Smolders ◽

D. Van Compernolle

Keyword(s):

Speech Recognition ◽

Speaker Independent

Download Full-text

A Diagonally Weighted Binary Memristor Crossbar Architecture Based on Multilayer Neural Network for Better Accuracy Rate in Speech Recognition Application

Advances in Electrical and Computer Engineering ◽

10.4316/aece.2019.02010 ◽

2019 ◽

Vol 19 (2) ◽

pp. 75-82

Author(s):

M.-H. VO

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Accuracy Rate ◽

Crossbar Architecture ◽

Memristor Crossbar

Download Full-text

Comparison of Two Forced Alignments Systems for Aligning Bribri Speech

CLEI electronic journal ◽

10.19153/cleiej.20.1.2 ◽

2017 ◽

Author(s):

Sofía Flores Solórzano ◽

Rolando Coto-Solano

Keyword(s):

Natural Language Processing ◽

Speech Recognition ◽

Language Processing ◽

Indigenous Languages ◽

Future Research ◽

Automated Speech Recognition ◽

Recognition Systems ◽

Para Determinar ◽

Future Work ◽

Manual Correction

Abstract: Forced alignment provides drastic savings in time when aligning speech recordings and is particularly useful for the study of Indigenous languages, which are severely under-resourced in corpora and models. Here we compare two forced alignment systems, FAVE-align and EasyAlign, to determine which one provides more precision when processing running speech in the Chibchan language Bribri. We aligned a segment of a story narrated in Bribri and compared the errors in finding the center of the words and the edges of phonemes when compared with the manual correction. FAVE-align showed better performance: It has an error of 7% compared to 24% with EasyAlign when finding the center of words, and errors of 22~24 ms when finding the edges of phonemes, compared to errors of 86~130 ms with EasyAlign. In addition to this, EasyAlign failed to detect 7% of phonemes, while also inserting 58 spurious phones into the transcription. Future research includes verifying these results for other genres and other Chibchan languages. Finally, these results provide additional evidence for the applicability of natural language processing methods to Chibchan languages and point to future work such as the construction of corpora and the training of automated speech recognition systems. Spanish Abstract: El alineamiento forzado provee un ahorro drástico de tiempo al alinear grabaciones del habla, y es útil para el estudio de las lenguas indígenas, las cuales cuentan con pocos recursos para generar corpus y modelos computacionales. Aquí comparamos dos sistemas de alineamiento, FAVE-align e EasyAlign, para determinar cuál provee mayor precisión al alinear habla en la lengua chibcha bribri. Alineamos una narración y comparamos el error al tratar de encontrar el centro de las palabras y los bordes de los fonemas con sus equivalentes en una corrección manual. FAVE-align tuvo mejor rendimiento, con un error de 7% comparado con 24% de EasyAlign para el centro de las palabras, y con errores de 22~24 ms para el borde de los fonemas, comparado con 86~130 ms con EasyAlign. Además, EasyAlign no pudo detectar el 7% de los fonemas, y al mismo tiempo añadió 58 sonidos espurios a la transcripción. Como trabajo futuro verificaremos estos resultados con otros géneros hablados y con otras lenguas chibchas. Finalmente, estos resultados comprueban la aplicabilidad de los métodos de procesamiento de lengua natural a las lenguas chibchas, y apuntan a trabajo futuro en la construcción de corpus y el entrenamiento de sistemas de reconocimiento automático del habla.

Download Full-text