Two-stage phone recognition system using articulatory and spectral features

In this study, we evaluate and compare two different approaches for multilingual phone recognition in code-switched and non-code-switched scenarios. First approach is a front-end Language Identification (LID)-switched to a monolingual phone recognizer (LID-Mono), trained individually on each of the languages present in multilingual dataset. In the second approach, a common multilingual phone-set derived from the International Phonetic Alphabet (IPA) transcription of the multilingual dataset is used to develop a Multilingual Phone Recognition System (Multi-PRS). The bilingual code-switching experiments are conducted using Kannada and Urdu languages. In the first approach, LID is performed using the state-of-the-art i-vectors. Both monolingual and multilingual phone recognition systems are trained using Deep Neural Networks. The performance of LID-Mono and Multi-PRS approaches are compared and analysed in detail. It is found that the performance of Multi-PRS approach is superior compared to more conventional LID-Mono approach in both code-switched and non-code-switched scenarios. For code-switched speech, the effect of length of segments (that are used to perform LID) on the performance of LID-Mono system is studied by varying the window size from 500 ms to 5.0 s, and full utterance. The LID-Mono approach heavily depends on the accuracy of the LID system and the LID errors cannot be recovered. But, the Multi-PRS system by virtue of not having to do a front-end LID switching and designed based on the common multilingual phone-set derived from several languages, is not constrained by the accuracy of the LID system, and hence performs effectively on code-switched and non-code-switched speech, offering low Phone Error Rates than the LID-Mono system.

Download Full-text

Robust Speech Emotion Recognition System Through Novel ER-CNN and Spectral Features

10.1109/isaect53699.2021.9668480 ◽

2021 ◽

Author(s):

Muhammad Zeeshan ◽

Huma Qayoom ◽

Farman Hassan

Keyword(s):

Emotion Recognition ◽

Recognition System ◽

Speech Emotion Recognition ◽

Spectral Features

Download Full-text

Enhancement of Speech Recognition System by neural network approaches of Clustering

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v6i1.4456 ◽

2013 ◽

Vol 6 (1) ◽

pp. 266-271

Author(s):

Anurag Upadhyay ◽

Chitranjanjit Kaur

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Recurrent Neural Networks ◽

Alternative Energy ◽

Recognition Rate ◽

Speech Sound ◽

Recognition System ◽

Training Methods ◽

Indian Languages ◽

Phone Recognition

This paper addresses the problem of speech recognition to identify various modes of speech data. Speaker sounds are the acoustic sounds of speech. Statistical models of speech have been widely used for speech recognition under neural networks. In paper we propose and try to justify a new model in which speech co articulation the effect of phonetic context on speech sound is modeled explicitly under a statistical framework. We study speech phone recognition by recurrent neural networks and SOUL Neural Networks. A general framework for recurrent neural networks and considerations for network training are discussed in detail. SOUL NN clustering the large vocabulary that compresses huge data sets of speech. This project also different Indian languages utter by different speakers in different modes such as aggressive, happy, sad, and angry. Many alternative energy measures and training methods are proposed and implemented. A speaker independent phone recognition rate of 82% with 25% frame error rate has been achieved on the neural data base. Neural speech recognition experiments on the NTIMIT database result in a phone recognition rate of 68% correct. The research results in this thesis are competitive with the best results reported in the literature.Â

Download Full-text

Speech-based human emotion recognition

10.32920/ryerson.14651964 ◽

2021 ◽

Author(s):

Talieh Seyed Tabtabae

Keyword(s):

Emotion Recognition ◽

Emotional State ◽

Speaker Identification ◽

Recognition System ◽

Research Area ◽

Speech Signals ◽

Spectral Features ◽

Emotional States ◽

Human Communication ◽

Set Up

Automatic Emotion Recognition (AER) is an emerging research area in the Human-Computer Interaction (HCI) field. As Computers are becoming more and more popular every day, the study of interaction between humans (users) and computers is catching more attention. In order to have a more natural and friendly interface between humans and computers, it would be beneficial to give computers the ability to recognize situations the same way a human does. Equipped with an emotion recognition system, computers will be able to recognize their users' emotional state and show the appropriate reaction to that. In today's HCI systems, machines can recognize the speaker and also content of the speech, using speech recognition and speaker identification techniques. If machines are equipped with emotion recognition techniques, they can also know "how it is said" to react more appropriately, and make the interaction more natural. One of the most important human communication channels is the auditory channel which carries speech and vocal intonation. In fact people can perceive each other's emotional state by the way they talk. Therefore in this work the speech signals are analyzed in order to set up an automatic system which recognizes the human emotional state. Six discrete emotional states have been considered and categorized in this research: anger, happiness, fear, surprise, sadness, and disgust. A set of novel spectral features are proposed in this contribution. Two approaches are applied and the results are compared. In the first approach, all the acoustic features are extracted from consequent frames along the speech signals. The statistical values of features are considered to constitute the features vectors. Suport Vector Machine (SVM), which is a relatively new approach in the field of machine learning is used to classify the emotional states. In the second approach, spectral features are extracted from non-overlapping logarithmically-spaced frequency sub-bands. In order to make use of all the extracted information, sequence discriminant SVMs are adopted. The empirical results show that the employed techniques are very promising.

Download Full-text

Development and Analysis of Multilingual Phone Recognition System

10.1007/978-3-030-80741-2_3 ◽

2021 ◽

pp. 27-46

Author(s):

K. E. Manjunath

Keyword(s):

Recognition System ◽

Phone Recognition

Download Full-text

Two stage utterance verification device and method thereof in speech recognition system

The Journal of the Acoustical Society of America ◽

10.1121/1.3238224 ◽

2009 ◽

Vol 126 (4) ◽

pp. 2141

Author(s):

Sanghun Kim ◽

YoungJik Lee

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Two Stage ◽

Utterance Verification

Download Full-text

Two-Stage License Plate Recognition System using Deep learning

2019 8th International Conference on Innovation, Communication and Engineering (ICICE) ◽

10.1109/icice49024.2019.9117277 ◽

2019 ◽

Author(s):

Cheng-Hung Lin ◽

Yi-Sin Sie

Keyword(s):

Deep Learning ◽

Recognition System ◽

License Plate ◽

License Plate Recognition ◽

Two Stage

Download Full-text

Two-stage Strategy for Small-footprint Wake-up-word Speech Recognition System

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9207489 ◽

2020 ◽

Author(s):

Xinya You ◽

Yajie Zhao ◽

Mingyuan Sun

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Two Stage ◽

Small Footprint

Download Full-text

Multi-View Hand-Hygiene Recognition for Food Safety

Journal of Imaging ◽

10.3390/jimaging6110120 ◽

2020 ◽

Vol 6 (11) ◽

pp. 120

Author(s):

Chengzhang Zhong ◽

Amy R. Reibman ◽

Hansel A. Mina ◽

Amanda J. Deering

Keyword(s):

Food Safety ◽

Hand Hygiene ◽

Low Cost ◽

Recognition System ◽

Food Handling ◽

Two Stage ◽

Camera System ◽

Third Person ◽

Stage System ◽

Handling Practices

A majority of foodborne illnesses result from inappropriate food handling practices. One proven practice to reduce pathogens is to perform effective hand-hygiene before all stages of food handling. In this paper, we design a multi-camera system that uses video analytics to recognize hand-hygiene actions, with the goal of improving hand-hygiene effectiveness. Our proposed two-stage system processes untrimmed video from both egocentric and third-person cameras. In the first stage, a low-cost coarse classifier efficiently localizes the hand-hygiene period; in the second stage, more complex refinement classifiers recognize seven specific actions within the hand-hygiene period. We demonstrate that our two-stage system has significantly lower computational requirements without a loss of recognition accuracy. Specifically, the computationally complex refinement classifiers process less than 68% of the untrimmed videos, and we anticipate further computational gains in videos that contain a larger fraction of non-hygiene actions. Our results demonstrate that a carefully designed video action recognition system can play an important role in improving hand hygiene for food safety.

Download Full-text