Deep Multimodal Emotion Recognition on Human Speech: A Review

This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.

Download Full-text

Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414006 ◽

2021 ◽

Author(s):

Lili Guo ◽

Longbiao Wang ◽

Chenglin Xu ◽

Jianwu Dang ◽

Eng Siong Chng ◽

...

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

10.21437/interspeech.2021-2067 ◽

2021 ◽

Author(s):

Jiaxing Liu ◽

Yaodong Song ◽

Longbiao Wang ◽

Jianwu Dang ◽

Ruiguo Yu

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Convolutional Network ◽

Time Frequency ◽

Frequency Representation

Download Full-text

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

10.21437/interspeech.2018-1242 ◽

2018 ◽

Cited By ~ 25

Author(s):

Pengcheng Li ◽

Yan Song ◽

Ian McLoughlin ◽

Wu Guo ◽

Lirong Dai

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Learning Method

Download Full-text

Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition

10.21437/interspeech.2020-2572 ◽

2020 ◽

Author(s):

Weiquan Fan ◽

Xiangmin Xu ◽

Xiaofen Xing ◽

Dongyan Huang

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition

10.21437/interspeech.2021-1133 ◽

2021 ◽

Author(s):

Yuan Gao ◽

Jiaxing Liu ◽

Longbiao Wang ◽

Jianwu Dang

Keyword(s):

Emotion Recognition ◽

Metric Learning ◽

Feature Representation ◽

Speech Emotion Recognition ◽

Fusion Model

Download Full-text

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Communications in Computer and Information Science - Neural Information Processing ◽

10.1007/978-3-030-36808-1_74 ◽

2019 ◽

pp. 681-689

Author(s):

Jiaxing Liu ◽

Zhilei Liu ◽

Longbiao Wang ◽

Lili Guo ◽

Jianwu Dang

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Time Frequency

Download Full-text

Feature representation for speech emotion recognition

2017 Iranian Conference on Electrical Engineering (ICEE) ◽

10.1109/iraniancee.2017.7985273 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mehdi Abdollahpour ◽

Jafar Zamani ◽

Hamidreza Saligheh Rad

Keyword(s):

Emotion Recognition ◽

Feature Representation ◽

Speech Emotion Recognition

Download Full-text

Survey of Deep Representation Learning for Speech Emotion Recognition

10.36227/techrxiv.16689484 ◽

2021 ◽

Author(s):

Siddique Latif ◽

Rajib Rana ◽

Sara Khalifa ◽

Raja Jurdak ◽

Junaid Qadir ◽

...

Keyword(s):

Emotion Recognition ◽

General Setting ◽

Representation Learning ◽

Data Driven ◽

Speech Emotion Recognition ◽

Feature Engineering ◽

Acoustic Features ◽

Learning Techniques ◽

Comprehensive Survey ◽

Hierarchical Representations

<div>Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated deep representation learning where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.</div>

Download Full-text