Towards Discriminative Representation Learning for Speech Emotion Recognition

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/703 ◽

2019 ◽

Cited By ~ 2

Author(s):

Runnan Li ◽

Zhiyong Wu ◽

Jia Jia ◽

Yaohua Bu ◽

Sheng Zhao ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Representation Learning ◽

Speech Emotion Recognition ◽

User Intention ◽

Global Context ◽

Interaction Database ◽

Benchmark Database ◽

Realistic Interaction ◽

Speech Interaction

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multi-head Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6% to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text

Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414006 ◽

2021 ◽

Author(s):

Lili Guo ◽

Longbiao Wang ◽

Chenglin Xu ◽

Jianwu Dang ◽

Eng Siong Chng ◽

...

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

10.21437/interspeech.2021-2067 ◽

2021 ◽

Author(s):

Jiaxing Liu ◽

Yaodong Song ◽

Longbiao Wang ◽

Jianwu Dang ◽

Ruiguo Yu

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Convolutional Network ◽

Time Frequency ◽

Frequency Representation

Download Full-text

An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition

10.21437/interspeech.2018-1242 ◽

2018 ◽

Cited By ~ 25

Author(s):

Pengcheng Li ◽

Yan Song ◽

Ian McLoughlin ◽

Wu Guo ◽

Lirong Dai

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Learning Method

Download Full-text

Gender Classification for Emotional Speech using GMFCC and Deep LSTM

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a6109.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 3923-3928

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Identification Accuracy ◽

Speech Emotion Recognition ◽

Gender Classification ◽

Feature Extraction Method ◽

Long Short Term Memory ◽

Pass Through ◽

Mel Frequency Cepstral Coefficient

We have come to the point that one of the important aspects of the process speech emotion recognition is the gender classification. The correct classification of gender will improve the performance of Speech Emotion Recognition (SER) system towards its robustness. Here, we are specifically referring to Gammatone Mel Frequency Cepstral Coefficient (GMFCC) as a feature extraction method that extracts features from IITKGPSESHC dataset, which is very crucial in deciding either male or female in gender classification. The well known classifier “Deep Long Short Term Memory (Deep LSTM)” is itself an important kind of Recurrent Neural Network (RNN) that handles the longrange dependencies more efficiently than the RNNs. The GMFCC feature has to pass through the Deep LSTM and get average percent gender identification accuracy of 98.3%.

Download Full-text

RNN-based Dimensional Speech Emotion Recognition

10.31227/osf.io/wa3vp ◽

2020 ◽

Author(s):

Bagus Tris Atmaja

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Mean Squared Error ◽

Absolute Error ◽

Recognition System ◽

Speech Emotion Recognition ◽

Percentage Error ◽

Concordance Correlation ◽

Acoustic Feature ◽

Dense System

◆ A speech emotion recognition system based on recurrent neural networks is developed using long short-term memory networks.◆ Two of acoustic feature sets are evaluated: 31 Features (3 time-domain features, 5 frequency-domain features, 13 MFCCs, 5 F0s, and 5 Harmonics) and eGeMaps feature set (23 features).◆ To evaluate the performance, some metrics are used i.e. mean squared error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE) and concordance correlation coefficient (CCC). Among those metrics, CCC is main focus as it is used by other researchers.◆ The developed system used multi-task learning to maximize arousal, valence, and dominance at the same time using CCC loss (1 - CCC). The result shows using LSTM networks improve the CCC score compared to baseline dense system. The best CCC score isobtained on arousal followed by dominance and valence.

Download Full-text

Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition

10.21437/interspeech.2020-2572 ◽

2020 ◽

Author(s):

Weiquan Fan ◽

Xiangmin Xu ◽

Xiaofen Xing ◽

Dongyan Huang

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition

Download Full-text

Speech Emotion Recognition on Small Sample Learning by Hybrid WGAN-LSTM Networks

Journal of Circuits System and Computers ◽

10.1142/s0218126622500736 ◽

2021 ◽

Author(s):

Cunwei Sun ◽

Luping Ji ◽

Hailing Zhong

Keyword(s):

Emotion Recognition ◽

Language Processing ◽

Short Term Memory ◽

Small Sample ◽

New Method ◽

Small Samples ◽

Speech Emotion Recognition ◽

Generative Adversarial Network ◽

Adversarial Network ◽

In Series

The speech emotion recognition based on the deep networks on small samples is often a very challenging problem in natural language processing. The massive parameters of a deep network are much difficult to be trained reliably on small-quantity speech samples. Aiming at this problem, we propose a new method through the systematical cooperation of Generative Adversarial Network (GAN) and Long Short Term Memory (LSTM). In this method, it utilizes the adversarial training of GAN’s generator and discriminator on speech spectrogram images to implement sufficient sample augmentation. A six-layer convolution neural network (CNN), followed in series by a two-layer LSTM, is designed to extract features from speech spectrograms. For accelerating the training of networks, the parameters of discriminator are transferred to our feature extractor. By the sample augmentation, a well-trained feature extraction network and an efficient classifier could be achieved. The tests and comparisons on two publicly available datasets, i.e., EMO-DB and IEMOCAP, show that our new method is effective, and it is often superior to some state-of-the-art methods.

Download Full-text

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention

Communications in Computer and Information Science - Neural Information Processing ◽

10.1007/978-3-030-36808-1_74 ◽

2019 ◽

pp. 681-689

Author(s):

Jiaxing Liu ◽

Zhilei Liu ◽

Longbiao Wang ◽

Lili Guo ◽

Jianwu Dang

Keyword(s):

Emotion Recognition ◽

Representation Learning ◽

Speech Emotion Recognition ◽

Time Frequency

Download Full-text

Audio-Textual Emotion Recognition Based on Improved Neural Networks

Mathematical Problems in Engineering ◽

10.1155/2019/2593036 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Linqin Cai ◽

Yaxin Hu ◽

Jiangong Dong ◽

Sitong Zhou

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Recognition System ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Emotional Recognition ◽

Long Short Term Memory

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

Download Full-text