A Study on Speaker Independent Emotion Recognition from Speech Signals

This paper presents the results of GMM-based recognition for four basic emotions of Vietnamese such as neutral, sadness, anger and happiness. The characteristic parameters of these emotions are extracted from speech signals and divided into different parameter sets for experiments. The experiments are carried out according to speaker-dependent or speaker-independent and content-dependent or content-independent recognitions. The results showed that the recognition scores are rather high with the case for which there is a full combination of parameters as MFCC and its first and second derivatives, fundamental frequency, energy, formants and its correspondent bandwidths, spectral characteristics and F0 variants. In average, the speaker-dependent and content-dependent recognition scrore is 89.21%. Next, the average score is 82.27% for the speaker-dependent and content-independent recognition. For the speaker-independent and content-dependent recognition, the average score is 70.35%. The average score is 66.99% for speaker-independent and content-independent recognition. Information on F0 has significantly increased the score of recognition

Download Full-text

The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07 ◽

10.1109/icassp.2007.367152 ◽

2007 ◽

Cited By ~ 43

Author(s):

Marko Lugger ◽

Bin Yang

Keyword(s):

Emotion Recognition ◽

Voice Quality ◽

Speaker Independent ◽

Quality Features

Download Full-text

Speech-based human emotion recognition

10.32920/ryerson.14651964 ◽

2021 ◽

Author(s):

Talieh Seyed Tabtabae

Keyword(s):

Emotion Recognition ◽

Emotional State ◽

Speaker Identification ◽

Recognition System ◽

Research Area ◽

Speech Signals ◽

Spectral Features ◽

Emotional States ◽

Human Communication ◽

Set Up

Automatic Emotion Recognition (AER) is an emerging research area in the Human-Computer Interaction (HCI) field. As Computers are becoming more and more popular every day, the study of interaction between humans (users) and computers is catching more attention. In order to have a more natural and friendly interface between humans and computers, it would be beneficial to give computers the ability to recognize situations the same way a human does. Equipped with an emotion recognition system, computers will be able to recognize their users' emotional state and show the appropriate reaction to that. In today's HCI systems, machines can recognize the speaker and also content of the speech, using speech recognition and speaker identification techniques. If machines are equipped with emotion recognition techniques, they can also know "how it is said" to react more appropriately, and make the interaction more natural. One of the most important human communication channels is the auditory channel which carries speech and vocal intonation. In fact people can perceive each other's emotional state by the way they talk. Therefore in this work the speech signals are analyzed in order to set up an automatic system which recognizes the human emotional state. Six discrete emotional states have been considered and categorized in this research: anger, happiness, fear, surprise, sadness, and disgust. A set of novel spectral features are proposed in this contribution. Two approaches are applied and the results are compared. In the first approach, all the acoustic features are extracted from consequent frames along the speech signals. The statistical values of features are considered to constitute the features vectors. Suport Vector Machine (SVM), which is a relatively new approach in the field of machine learning is used to classify the emotional states. In the second approach, spectral features are extracted from non-overlapping logarithmically-spaced frequency sub-bands. In order to make use of all the extracted information, sequence discriminant SVMs are adopted. The empirical results show that the employed techniques are very promising.

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.4 ◽

2015 ◽

Vol 14 ◽

pp. 57-76

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Feature Reduction ◽

Two Stage ◽

Speaker Independent

Download Full-text

Emotion recognition from speech signals using new harmony features

Signal Processing ◽

10.1016/j.sigpro.2009.09.009 ◽

2010 ◽

Vol 90 (5) ◽

pp. 1415-1423 ◽

Cited By ~ 75

Author(s):

B. Yang ◽

M. Lugger

Keyword(s):

Emotion Recognition ◽

Speech Signals ◽

New Harmony

Download Full-text

On the relevance of high-level features for speaker independent emotion recognition of spontaneous speech

10.21437/interspeech.2009-483 ◽

2009 ◽

Author(s):

Marko Lugger ◽

Bin Yang

Keyword(s):

Emotion Recognition ◽

Spontaneous Speech ◽

Speaker Independent ◽

High Level

Download Full-text

Emotion Recognition From Speech Using Perceptual Filter and Neural Network

Advances in Computer and Electrical Engineering - Neural Networks for Natural Language Processing ◽

10.4018/978-1-7998-1159-6.ch004 ◽

2020 ◽

pp. 78-91 ◽

Cited By ~ 2

Author(s):

Revathi A. ◽

Sasikaladevi N.

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Vector Quantization ◽

Group Performance ◽

Back Propagation ◽

Critical Band ◽

Emotion Classification ◽

Back Propagation Algorithm ◽

Propagation Algorithm ◽

Speaker Independent

This chapter on multi speaker independent emotion recognition encompasses the use of perceptual features with filters spaced in Equivalent rectangular bandwidth (ERB) and BARK scale and vector quantization (VQ) classifier for classifying groups and artificial neural network with back propagation algorithm for emotion classification in a group. Performance can be improved by using the large amount of data in a pertinent emotion to adequately train the system. With the limited set of data, this proposed system has provided consistently better accuracy for the perceptual feature with critical band analysis done in ERB scale.

Download Full-text

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Sensors ◽

10.3390/s20216008 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6008 ◽

Cited By ~ 1

Author(s):

Misbah Farooq ◽

Fawad Hussain ◽

Naveed Khan Baloch ◽

Fawad Riasat Raja ◽

Heejung Yu ◽

...

Keyword(s):

Neural Network ◽

Feature Selection ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Human Machine Interaction ◽

Speaker Independent

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Download Full-text