An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).

Download Full-text

Survey of Deep Representation Learning for Speech Emotion Recognition

10.36227/techrxiv.16689484 ◽

2021 ◽

Author(s):

Siddique Latif ◽

Rajib Rana ◽

Sara Khalifa ◽

Raja Jurdak ◽

Junaid Qadir ◽

...

Keyword(s):

Emotion Recognition ◽

General Setting ◽

Representation Learning ◽

Data Driven ◽

Speech Emotion Recognition ◽

Feature Engineering ◽

Acoustic Features ◽

Learning Techniques ◽

Comprehensive Survey ◽

Hierarchical Representations

<div>Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated deep representation learning where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.</div>

Download Full-text

Speech Emotion Recognition with Local-Global Aware Deep Representation Learning

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9053192 ◽

2020 ◽

Author(s):

Jiaxing Liu ◽

Zhilei Liu ◽

Longbiao Wang ◽

Lili Guo ◽

Jianwu Dang

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Deep Multimodal Emotion Recognition on Human Speech: A Review

Applied Sciences ◽

10.3390/app11177962 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7962

Author(s):

Panagiotis Koromilas ◽

Theodoros Giannakopoulos

Keyword(s):

Emotion Recognition ◽

Visual Information ◽

Multimodal Interaction ◽

Representation Learning ◽

Basic Feature ◽

Feature Representation ◽

Speech Emotion Recognition ◽

Temporal Dimension ◽

Multimodal Interactions ◽

Depth Analysis

This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.

Download Full-text

Survey of Deep Representation Learning for Speech Emotion Recognition

10.36227/techrxiv.16689484.v1 ◽

2021 ◽

Author(s):

Siddique Latif ◽

Rajib Rana ◽

Sara Khalifa ◽

Raja Jurdak ◽

Junaid Qadir ◽

...

Keyword(s):

Emotion Recognition ◽

General Setting ◽

Representation Learning ◽

Data Driven ◽

Speech Emotion Recognition ◽

Feature Engineering ◽

Acoustic Features ◽

Learning Techniques ◽

Comprehensive Survey ◽

Hierarchical Representations

<div>Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual effort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated deep representation learning where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.</div>

Download Full-text

Towards Discriminative Representation Learning for Speech Emotion Recognition

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/703 ◽

2019 ◽

Cited By ~ 2

Author(s):

Runnan Li ◽

Zhiyong Wu ◽

Jia Jia ◽

Yaohua Bu ◽

Sheng Zhao ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Representation Learning ◽

Speech Emotion Recognition ◽

User Intention ◽

Global Context ◽

Interaction Database ◽

Benchmark Database ◽

Realistic Interaction ◽

Speech Interaction

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multi-head Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6% to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.

Download Full-text

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

Survey of Deep Representation Learning for Speech Emotion Recognition

Speech Emotion Recognition with Local-Global Aware Deep Representation Learning

Deep Multimodal Emotion Recognition on Human Speech: A Review

Survey of Deep Representation Learning for Speech Emotion Recognition

Towards Discriminative Representation Learning for Speech Emotion Recognition

Export Citation Format