Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder

Yangwei Ying; Yuanwu Tu; Hong Zhou

doi:10.3390/electronics10172086

Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder

Electronics ◽

10.3390/electronics10172086 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2086

Author(s):

Yangwei Ying ◽

Yuanwu Tu ◽

Hong Zhou

Keyword(s):

Emotion Recognition ◽

Data Augmentation ◽

Feature Learning ◽

Human Potential ◽

Speech Emotion Recognition ◽

Unsupervised Feature Learning ◽

Learning Techniques ◽

Speech Data ◽

Data Division ◽

Speech Features

Speech signals contain abundant information on personal emotions, which plays an important part in the representation of human potential characteristics and expressions. However, the deficiency of emotion speech data affects the development of speech emotion recognition (SER), which also limits the promotion of recognition accuracy. Currently, the most effective approach is to make use of unsupervised feature learning techniques to extract speech features from available speech data and generate emotion classifiers with these features. In this paper, we proposed to implement autoencoders such as a denoising autoencoder (DAE) and an adversarial autoencoder (AAE) to extract the features from LibriSpeech for model pre-training, and then conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets for classification. Considering the imbalance of data distribution in IEMOCAP, we developed a novel data augmentation approach to optimize the overlap shift between consecutive segments and redesigned the data division. The best classification accuracy reached 78.67% (weighted accuracy, WA) and 76.89% (unweighted accuracy, UA) with AAE. Compared with state-of-the-art results to our knowledge (76.18% of WA and 76.36% of UA with the supervised learning method), we achieved a slight advantage. This suggests that using unsupervised learning benefits the development of SER and provides a new approach to eliminate the problem of data scarcity.

Download Full-text

Speech emotion recognition with unsupervised feature learning

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1400323 ◽

2015 ◽

Vol 16 (5) ◽

pp. 358-366 ◽

Cited By ~ 21

Author(s):

Zheng-wei Huang ◽

Wen-tao Xue ◽

Qi-rong Mao

Keyword(s):

Emotion Recognition ◽

Feature Learning ◽

Speech Emotion Recognition ◽

Unsupervised Feature Learning

Download Full-text

Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives

Frontiers in Neurorobotics ◽

10.3389/fnbot.2021.784514 ◽

2021 ◽

Vol 15 ◽

Author(s):

Shiqing Zhang ◽

Ruixin Liu ◽

Xin Tao ◽

Xiaoming Zhao

Keyword(s):

Deep Learning ◽

Emotion Recognition ◽

Feature Learning ◽

Learning Ability ◽

Speech Emotion Recognition ◽

Practical Applications ◽

Learning Techniques ◽

Challenges And Opportunities ◽

Comprehensive Survey ◽

Cross Language

Automatic speech emotion recognition (SER) is a challenging component of human-computer interaction (HCI). Existing literatures mainly focus on evaluating the SER performance by means of training and testing on a single corpus with a single language setting. However, in many practical applications, there are great differences between the training corpus and testing corpus. Due to the diversity of different speech emotional corpus or languages, most previous SER methods do not perform well when applied in real-world cross-corpus or cross-language scenarios. Inspired by the powerful feature learning ability of recently-emerged deep learning techniques, various advanced deep learning models have increasingly been adopted for cross-corpus SER. This paper aims to provide an up-to-date and comprehensive survey of cross-corpus SER, especially for various deep learning techniques associated with supervised, unsupervised and semi-supervised learning in this area. In addition, this paper also highlights different challenges and opportunities on cross-corpus SER tasks, and points out its future trends.

Download Full-text

Speech Emotion Recognition Framework based on User Self-referential Speech Features

2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) ◽

10.1109/gcce.2018.8574676 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kyoungju Noh ◽

Seungeun Chung ◽

Jiyoun Lim ◽

Gague Kim ◽

Hyuntae Jeong

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

Speech Emotion Recognition Using Machine Learning Techniques

Advances in Intelligent Systems and Computing - Congress on Intelligent Systems ◽

10.1007/978-981-33-6984-9_15 ◽

2021 ◽

pp. 169-178

Author(s):

Sreeja Sasidharan Rajeswari ◽

G. Gopakumar ◽

Manjusha Nair

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Learning Techniques

Download Full-text

Intrusion detection systems using classical machine learning techniques vs integrated unsupervised feature learning and deep neural network

Internet Technology Letters ◽

10.1002/itl2.232 ◽

2020 ◽

Cited By ~ 1

Author(s):

Shisrut Rawat ◽

Aishwarya Srinivasan ◽

Vinayakumar Ravi ◽

Uttam Ghosh

Keyword(s):

Neural Network ◽

Machine Learning ◽

Intrusion Detection ◽

Deep Neural Network ◽

Feature Learning ◽

Intrusion Detection Systems ◽

Machine Learning Techniques ◽

Unsupervised Feature Learning ◽

Detection Systems ◽

Learning Techniques

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

A Study on the Search of the Most Discriminative Speech Features in the Speaker Dependent Speech Emotion Recognition

2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming ◽

10.1109/paap.2012.31 ◽

2012 ◽

Cited By ~ 11

Author(s):

Tsang-Long Pao ◽

Chun-Hsiang Wang ◽

Yu-Ji Li

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text