HAN-ReGRU: hierarchical attention network with residual gated recurrent unit for emotion recognition in conversation

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

Download Full-text

Anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00234-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Jiacheng Yao ◽

Jing Zhang ◽

Jiafeng Li ◽

Li Zhuo

Keyword(s):

Video Streaming ◽

Feature Vector ◽

The Self ◽

Live Streaming ◽

Network Environment ◽

Speech Separation ◽

Attention Network ◽

Feature Sequence ◽

Gated Recurrent Unit ◽

Great Harm

AbstractWith the sharp booming of online live streaming platforms, some anchors seek profits and accumulate popularity by mixing inappropriate content into live programs. After being blacklisted, these anchors even forged their identities to change the platform to continue live, causing great harm to the network environment. Therefore, we propose an anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit (GRU) for anchor identification of live platform. First, the speech of the anchor is extracted from the live streaming by using voice activation detection (VAD) and speech separation. Then, the feature sequence of anchor voiceprint is generated from the speech waveform with the self-attention network RawNet-SA. Finally, the feature sequence of anchor voiceprint is aggregated by GRU to transform into a deep voiceprint feature vector for anchor recognition. Experiments are conducted on the VoxCeleb, CN-Celeb, and MUSAN dataset, and the competitive results demonstrate that our method can effectively recognize the anchor voiceprint in video streaming.

Download Full-text