Survey of Deep Representation Learning for Speech Emotion Recognition

<div>Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated deep representation learning where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.</div>

Download Full-text

Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives

Frontiers in Neurorobotics ◽

10.3389/fnbot.2021.784514 ◽

2021 ◽

Vol 15 ◽

Author(s):

Shiqing Zhang ◽

Ruixin Liu ◽

Xin Tao ◽

Xiaoming Zhao

Keyword(s):

Deep Learning ◽

Emotion Recognition ◽

Feature Learning ◽

Learning Ability ◽

Speech Emotion Recognition ◽

Practical Applications ◽

Learning Techniques ◽

Challenges And Opportunities ◽

Comprehensive Survey ◽

Cross Language

Automatic speech emotion recognition (SER) is a challenging component of human-computer interaction (HCI). Existing literatures mainly focus on evaluating the SER performance by means of training and testing on a single corpus with a single language setting. However, in many practical applications, there are great differences between the training corpus and testing corpus. Due to the diversity of different speech emotional corpus or languages, most previous SER methods do not perform well when applied in real-world cross-corpus or cross-language scenarios. Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have increasingly been adopted for cross-corpus SER. This paper aims to provide an up-to-date and comprehensive survey of cross-corpus SER, especially for various deep learning techniques associated with supervised, unsupervised and semi-supervised learning in this area. In addition, this paper also highlights different challenges and opportunities on cross-corpus SER tasks, and points out its future trends.

Download Full-text

Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414006 ◽

2021 ◽

Author(s):

Lili Guo ◽

Longbiao Wang ◽

Chenglin Xu ◽

Jianwu Dang ◽

Eng Siong Chng ◽

...

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

10.21437/interspeech.2021-2067 ◽

2021 ◽

Author(s):

Jiaxing Liu ◽

Yaodong Song ◽

Longbiao Wang ◽

Jianwu Dang ◽

Ruiguo Yu

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Convolutional Network ◽

Time Frequency ◽

Frequency Representation

Download Full-text

Speech Emotion Recognition Using Machine Learning Techniques

Advances in Intelligent Systems and Computing - Congress on Intelligent Systems ◽

10.1007/978-981-33-6984-9_15 ◽

2021 ◽

pp. 169-178

Author(s):

Sreeja Sasidharan Rajeswari ◽

G. Gopakumar ◽

Manjusha Nair

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Learning Techniques

Download Full-text

Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model

Speech Communication ◽

10.1016/j.specom.2019.04.004 ◽

2019 ◽

Vol 110 ◽

pp. 1-12 ◽

Cited By ~ 13

Author(s):

Xingfeng Li ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Acoustic Features ◽

Layer Model

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

10.21437/interspeech.2018-1242 ◽

2018 ◽

Cited By ~ 25

Author(s):

Pengcheng Li ◽

Yan Song ◽

Ian McLoughlin ◽

Wu Guo ◽

Lirong Dai

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Learning Method

Download Full-text

Speech Emotion Recognition System

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-v4-i3-024 ◽

2021 ◽

pp. 156-159

Author(s):

Sourabh Suke ◽

Ganesh Regulwar ◽

Nikesh Aote ◽

Pratik Chaudhari ◽

Rajat Ghatode ◽

...

Keyword(s):

Emotion Recognition ◽

Automobile Industry ◽

Emotional State ◽

Recognition System ◽

Classification Model ◽

General Idea ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Acoustic Features

This project describes "VoiEmo- A Speech Emotion Recognizer", a system for recognizing the emotional state of an individual from his/her speech. For example, one's speech becomes loud and fast, with a higher and wider range in pitch, when in a state of fear, anger, or joy whereas human voice is generally slow and low pitched in sadness and tiredness. We have particularly developed a classification model speech emotion detection based on Convolutional neural networks (CNNs), Support Vector Machine (SVM), Multilayer Perceptron (MLP) Classification which make predictions considering the acoustic features of speech signal such as Mel Frequency Cepstral Coefficient (MFCC). Our models have been trained to recognize seven common emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise). For training and testing the model, we have used relevant data from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset and the Toronto Emotional Speech Set (TESS) Dataset. The system is advantageous as it can provide a general idea about the emotional state of the individual based on the acoustic features of the speech irrespective of the language the speaker speaks in, moreover, it also saves time and effort. Speech emotion recognition systems have their applications in various fields like in call centers and BPOs, criminal investigation, psychiatric therapy, the automobile industry, etc.

Download Full-text

Speech emotion recognition of Hindi speech using statistical and machine learning techniques

Journal of Interdisciplinary Mathematics ◽

10.1080/09720502.2020.1721926 ◽

2020 ◽

Vol 23 (1) ◽

pp. 311-319

Author(s):

Akshat Agrawal ◽

Anurag Jain

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Learning Techniques

Download Full-text