Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.

Download Full-text

Speech emotion recognition via a max-margin framework incorporating a loss function based on the Watson and Tellegen's emotion model

2009 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2009.4960547 ◽

2009 ◽

Cited By ~ 6

Author(s):

Sungrack Yun ◽

Chang D. Yoo

Keyword(s):

Emotion Recognition ◽

Loss Function ◽

Speech Emotion Recognition ◽

Emotion Model

Download Full-text

f-Similarity Preservation Loss for Soft Labels: A Demonstration on Cross-Corpus Speech Emotion Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015725 ◽

2019 ◽

Vol 33 ◽

pp. 5725-5732

Author(s):

Biqiao Zhang ◽

Yuqing Kong ◽

Georg Essl ◽

Emily Mower Provost

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Loss Function ◽

Deep Neural Networks ◽

Metric Learning ◽

Loss Functions ◽

Speech Emotion Recognition ◽

Subjective Data ◽

Dual Form ◽

Deep Metric Learning

In this paper, we propose a Deep Metric Learning (DML) approach that supports soft labels. DML seeks to learn representations that encode the similarity between examples through deep neural networks. DML generally presupposes that data can be divided into discrete classes using hard labels. However, some tasks, such as our exemplary domain of speech emotion recognition (SER), work with inherently subjective data, data for which it may not be possible to identify a single hard label. We propose a family of loss functions, fSimilarity Preservation Loss (f-SPL), based on the dual form of f-divergence for DML with soft labels. We show that the minimizer of f-SPL preserves the pairwise label similarities in the learned feature embeddings. We demonstrate the efficacy of the proposed loss function on the task of cross-corpus SER with soft labels. Our approach, which combines f-SPL and classification loss, significantly outperforms a baseline SER system with the same structure but trained with only classification loss in most experiments. We show that the presented techniques are more robust to over-training and can learn an embedding space in which the similarity between examples is meaningful.

Download Full-text

Modeling variable length phoneme sequences — A step towards linguistic information for speech emotion recognition in wider world

2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) ◽

10.1109/acii.2017.8273648 ◽

2017 ◽

Cited By ~ 2

Author(s):

Kalani Wataraka Gamage ◽

Vidhyasaharan Sethu ◽

Eliathamby Ambikairajah

Keyword(s):

Emotion Recognition ◽

Variable Length ◽

Speech Emotion Recognition ◽

Linguistic Information

Download Full-text

End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition

2020 25th International Conference on Pattern Recognition (ICPR) ◽

10.1109/icpr48806.2021.9413144 ◽

2021 ◽

Author(s):

Puneet Kumar ◽

Sidharth Jain ◽

Balasubramanian Raman ◽

Partha Pratim Roy ◽

Masakazu Iwamura

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Triplet Loss ◽

End To End

Download Full-text

Emotion recognition from speech using deep learning on spectrograms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-191129 ◽

2020 ◽

Vol 39 (3) ◽

pp. 2791-2796

Author(s):

Xingguang Li ◽

Wenjun Song ◽

Zonglin Liang

Keyword(s):

Deep Learning ◽

Emotion Recognition ◽

Input Sequence ◽

Variable Length ◽

Speech Emotion Recognition ◽

Sample Length ◽

Network Training ◽

Effective Part ◽

And Performance ◽

Deep Learning Model

In speech emotion recognition, most emotional corpora generally have problems such as inconsistent sample length and imbalance of sample categories. Considering these problems, in this paper, a variable length input CRNN deep learning model based on Focal Loss is proposed for speech emotion recognition of anger, happiness, neutrality and sadness in IEMOCAP emotional corpus. In this model, Firstly, a variable-length strategy is introduced to input the speech spectra of the filled speech samples into CNN. Then the effective part of the input sequence is preserved and output by masking matrix and convolution layer. Thirdly, the effective output of input sequence is input into BiGRU network for learning. Finally, the focal loss is used for network training to control and adjust the contribution of various samples to the total loss. Compared with the traditional speech emotion recognition model, simulations show that our method can effectively improve the accuracy and performance of emotion recognition.

Download Full-text

Speech Emotion Recognition Based on Sparse Representation

Archives of Acoustics ◽

10.2478/aoa-2013-0055 ◽

2013 ◽

Vol 38 (4) ◽

pp. 465-470 ◽

Cited By ~ 11

Author(s):

Jingjie Yan ◽

Xiaolan Wang ◽

Weiyi Gu ◽

LiLi Ma

Keyword(s):

Dimensionality Reduction ◽

Emotion Recognition ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Speech Emotion Recognition ◽

Least Squares Regression ◽

Computer Science Pedagogy ◽

Reduction Methods ◽

Analysis Computer

Abstract Speech emotion recognition is deemed to be a meaningful and intractable issue among a number of do- mains comprising sentiment analysis, computer science, pedagogy, and so on. In this study, we investigate speech emotion recognition based on sparse partial least squares regression (SPLSR) approach in depth. We make use of the sparse partial least squares regression method to implement the feature selection and dimensionality reduction on the whole acquired speech emotion features. By the means of exploiting the SPLSR method, the component parts of those redundant and meaningless speech emotion features are lessened to zero while those serviceable and informative speech emotion features are maintained and selected to the following classification step. A number of tests on Berlin database reveal that the recogni- tion rate of the SPLSR method can reach up to 79.23% and is superior to other compared dimensionality reduction methods.

Download Full-text