An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

Shreya Kumar; Swarnalaxmi Thiruvenkadam

doi:10.3991/ijes.v9i2.22983

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v9i2.22983 ◽

2021 ◽

Vol 9 (2) ◽

pp. 87

Author(s):

Shreya Kumar ◽

Swarnalaxmi Thiruvenkadam

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Prediction Accuracy ◽

Speech Emotion Recognition ◽

German Language ◽

Spectral Contrast ◽

Recognition Systems ◽

The Impact ◽

Contrast Feature

Feature extraction is an integral part in speech emotion recognition. Some emotions become indistinguishable from others due to high resemblance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset and JL corpus dataset were also used to test its performance over different English accents. In addition to that, EmoDB dataset has been used to study its performance in the German language. The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences in arousal levels, and it has been discussed in detail.<div> </div>

Download Full-text

No Sample Left Behind: Towards a Comprehensive Evaluation of Speech Emotion Recognition Systems

10.21437/smm.2019-3 ◽

2019 ◽

Cited By ~ 1

Author(s):

Pablo Riera ◽

Luciana Ferrer ◽

Agustín Gravano ◽

Lara Gauder

Keyword(s):

Emotion Recognition ◽

Comprehensive Evaluation ◽

Speech Emotion Recognition ◽

Left Behind ◽

Recognition Systems

Download Full-text

A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

Mathematical Problems in Engineering ◽

10.1155/2014/749604 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 21

Author(s):

Chenchen Huang ◽

Wei Gong ◽

Wenlong Fu ◽

Dongyu Feng

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Recognition Rate ◽

Original Method ◽

Speech Emotion Recognition ◽

High Dimensional ◽

Svm Classifier ◽

Multiple Classifier System ◽

Classifier System ◽

Multiple Classifier

Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

Download Full-text

Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment

Sensors ◽

10.3390/s20082297 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2297

Author(s):

Zhen-Tao Liu ◽

Bao-Han Wu ◽

Dan-Yun Li ◽

Peng Xiao ◽

Jun-Wei Mao

Keyword(s):

Emotion Recognition ◽

Feature Selection Method ◽

Sampling Technique ◽

Small Sample ◽

Speech Emotion Recognition ◽

Gradient Boosting ◽

Data Imbalance ◽

The Arts ◽

The Impact ◽

Sample Environment

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

Download Full-text

Feature extraction algorithms to improve the speech emotion recognition rate

International Journal of Speech Technology ◽

10.1007/s10772-020-09672-4 ◽

2020 ◽

Vol 23 (1) ◽

pp. 45-55 ◽

Cited By ~ 7

Author(s):

Anusha Koduru ◽

Hima Bindu Valiveti ◽

Anil Kumar Budati

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Recognition Rate ◽

Speech Emotion Recognition

Download Full-text

Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition

Human-Computer Interaction. Recognition and Interaction Technologies - Lecture Notes in Computer Science ◽

10.1007/978-3-030-22643-5_9 ◽

2019 ◽

pp. 117-132 ◽

Cited By ~ 1

Author(s):

Panikos Heracleous ◽

Yasser Mohammad ◽

Akio Yoneyama

Keyword(s):

Neural Networks ◽

Feature Extraction ◽

Emotion Recognition ◽

Convolutional Neural Networks ◽

Speech Emotion Recognition ◽

Deep Convolutional Neural Networks

Download Full-text

A feature extraction scheme based on enhanced wavelet coefficients for Speech Emotion Recognition

2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS) ◽

10.1109/mwscas.2014.6908609 ◽

2014 ◽

Cited By ~ 4

Author(s):

C. Shahnaz ◽

S. Sultana

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Wavelet Coefficients ◽

Extraction Scheme

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

Study of prosodic feature extraction for multidialectal Odia speech emotion recognition

2016 IEEE Region 10 Conference (TENCON) ◽

10.1109/tencon.2016.7848296 ◽

2016 ◽

Cited By ~ 1

Author(s):

Monorama Swain ◽

Aurobinda Routray ◽

P. Kabisatpathy ◽

Jogendra N. Kundu

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Prosodic Feature

Download Full-text

An Appraisal on Speech and Emotion Recognition Technologies based on Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e5715.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 2266-2276 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Emotion Recognition ◽

Speech Development ◽

Speech Emotion Recognition ◽

Part Of Speech ◽

Classification Feature ◽

The Way

In earlier days, people used speech as a means of communication or the way a listener is conveyed by voice or expression. But the idea of machine learning and various methods are necessary for the recognition of speech in the matter of interaction with machines. With a voice as a bio-metric through use and significance, speech has become an important part of speech development. In this article, we attempted to explain a variety of speech and emotion recognition techniques and comparisons between several methods based on existing algorithms and mostly speech-based methods. We have listed and distinguished speaking technologies that are focused on specifications, databases, classification, feature extraction, enhancement, segmentation and process of Speech Emotion recognition in this paper

Download Full-text

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00208-5 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Duowei Tang ◽

Peter Kuppens ◽

Luc Geurts ◽

Toon van Waterschoot

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Speech Signal ◽

State Of The Art ◽

Speech Emotion Recognition ◽

Model Parameters ◽

Proposed Model ◽

End To End ◽

The Impact ◽

Temporal Dependencies

AbstractAmongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

Download Full-text