Interactive Multimodal Attention Network for Emotion Recognition in Conversation

2021 ◽  
pp. 1-1
Author(s):  
Minjie Ren ◽  
Xiangdong Huang ◽  
Xiaoqi Shi ◽  
Weizhi Nie
IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 203814-203826
Author(s):  
Dong Yoon Choi ◽  
Deok-Hwan Kim ◽  
Byung Cheol Song

IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 215851-215862
Author(s):  
Gang Chen ◽  
Shiqing Zhang ◽  
Xin Tao ◽  
Xiaoming Zhao

2020 ◽  
Vol 34 (01) ◽  
pp. 303-311 ◽  
Author(s):  
Sicheng Zhao ◽  
Yunsheng Ma ◽  
Yang Gu ◽  
Jufeng Yang ◽  
Tengfei Xing ◽  
...  

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Xiaodong Liu ◽  
Miao Wang

Recognition of human emotion from facial expression is affected by distortions of pictorial quality and facial pose, which is often ignored by traditional video emotion recognition methods. On the other hand, context information can also provide different degrees of extra clues, which can further improve the recognition accuracy. In this paper, we first build a video dataset with seven categories of human emotion, named human emotion in the video (HEIV). With the HEIV dataset, we trained a context-aware attention network (CAAN) to recognize human emotion. The network consists of two subnetworks to process both face and context information. Features from facial expression and context clues are fused to represent the emotion of video frames, which will be then passed through an attention network and generate emotion scores. Then, the emotion features of all frames will be aggregated according to their emotional score. Experimental results show that our proposed method is effective on HEIV dataset.


Sign in / Sign up

Export Citation Format

Share Document