Transfer Learning of Deep Neural Network for Speech Emotion Recognition

We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).

Download Full-text

Speech emotion recognition using deep neural network and extreme learning machine

10.21437/interspeech.2014-57 ◽

2014 ◽

Cited By ~ 4

Author(s):

Kun Han ◽

Dong Yu ◽

Ivan Tashev

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Extreme Learning Machine ◽

Deep Neural Network ◽

Speech Emotion Recognition ◽

Learning Machine

Download Full-text

Simulation of English speech emotion recognition based on transfer learning and CNN neural network

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189231 ◽

2020 ◽

pp. 1-12

Author(s):

Xuehua Chen

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Transfer Learning ◽

Speech Emotion Recognition ◽

Training Strategy ◽

Model Based ◽

Time Translation ◽

The Difference ◽

Statistical Graph ◽

Weight Transfer

The difference between English and Chinese expressions is that English emphasizes the stress of syllables, so the recognition of English speech emotions plays an important role in learning English. This study uses transfer learning as the technical support to study English speech emotion recognition. The acoustic model based on weight transfer has two different training strategies: single-stage training and two-stage training strategy. By comparing the performance of the English speech emotion recognition model based on CNN neural network and the model proposed in this paper, the statistical comparison data is drawn into a statistical graph. The research results show that transfer learning has certain advantages over other algorithms in English speech emotion recognition. In the subsequent teaching and real-time translation equipment research, transfer learning can be applied to English models.

Download Full-text