Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation

Author(s):  
Sung-Lin Yeh ◽  
Yun-Shao Lin ◽  
Chi-Chun Lee
2021 ◽  
Author(s):  
Wenliang Dai ◽  
Samuel Cahyawijaya ◽  
Zihan Liu ◽  
Pascale Fung

2021 ◽  
pp. 110-118
Author(s):  
Dawei Liu ◽  
Longbiao Wang ◽  
Sheng Li ◽  
Haoyu Li ◽  
Chenchen Ding ◽  
...  

2020 ◽  
Vol 34 (01) ◽  
pp. 303-311 ◽  
Author(s):  
Sicheng Zhao ◽  
Yunsheng Ma ◽  
Yang Gu ◽  
Jufeng Yang ◽  
Tengfei Xing ◽  
...  

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.


Sign in / Sign up

Export Citation Format

Share Document