Audio Features for Music Emotion Recognition: a Survey

Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.

Download Full-text

Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

Computation ◽

10.3390/computation5020026 ◽

2017 ◽

Vol 5 (4) ◽

pp. 26 ◽

Cited By ~ 17

Author(s):

Michalis Papakostas ◽

Evaggelos Spyrou ◽

Theodoros Giannakopoulos ◽

Giorgos Siantikos ◽

Dimitrios Sgouropoulos ◽

...

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Visual Attributes ◽

Audio Features

Download Full-text

Bimodal Emotion Recognition Model for Minnan Songs

Information ◽

10.3390/info11030145 ◽

2020 ◽

Vol 11 (3) ◽

pp. 145 ◽

Cited By ~ 1

Author(s):

Zhenglong Xiang ◽

Xialei Dong ◽

Yuanxiang Li ◽

Fei Yu ◽

Xing Xu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Music Appreciation ◽

Research Papers ◽

Audio Features ◽

Analysis Theory ◽

Proposed Model ◽

Song Lyrics ◽

Long Short Term Memory

Most of the existing research papers study the emotion recognition of Minnan songs from the perspectives of music analysis theory and music appreciation. However, these investigations do not explore any possibility of carrying out an automatic emotion recognition of Minnan songs. In this paper, we propose a model that consists of four main modules to classify the emotion of Minnan songs by using the bimodal data—song lyrics and audio. In the proposed model, an attention-based Long Short-Term Memory (LSTM) neural network is applied to extract lyrical features, and a Convolutional Neural Network (CNN) is used to extract the audio features from the spectrum. Then, two kinds of extracted features are concatenated by multimodal compact bilinear pooling, and finally, the concatenated features are input to the classifying module to determine the song emotion. We designed three experiment groups to investigate the classifying performance of combinations of the four main parts, the comparisons of proposed model with the current approaches and the influence of a few key parameters on the performance of emotion recognition. The results show that the proposed model exhibits better performance over all other experimental groups. The accuracy, precision and recall of the proposed model exceed 0.80 in a combination of appropriate parameters.

Download Full-text

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5364 ◽

2020 ◽

Vol 34 (01) ◽

pp. 303-311 ◽

Cited By ~ 3

Author(s):

Sicheng Zhao ◽

Yunsheng Ma ◽

Yang Gu ◽

Jufeng Yang ◽

Tengfei Xing ◽

...

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

State Of The Art ◽

Source Code ◽

Cross Entropy ◽

Attention Network ◽

Audio Features ◽

End To End ◽

3D Cnn ◽

And Training

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

Download Full-text

Music Emotion Recognition with Standard and Melodic Audio Features

Applied Artificial Intelligence ◽

10.1080/08839514.2015.1016389 ◽

2015 ◽

Vol 29 (4) ◽

pp. 313-334 ◽

Cited By ~ 5

Author(s):

Renato Panda ◽

Bruno Rocha ◽

Rui Pedro Paiva

Keyword(s):

Emotion Recognition ◽

Audio Features

Download Full-text

Influence of Vocal and Instrumental Audio Features in Music Emotion Recognition

Proceedings of the International Engineering Conference ◽

10.3850/978-981-09-4587-9_p14 ◽

2014 ◽

Author(s):

N. Rosli ◽

N. Rajaee ◽

D. Bong

Keyword(s):

Emotion Recognition ◽

Audio Features

Download Full-text

Deep Learning Based Human Emotional State Recognition in a Video

Journal of Modeling and Optimization ◽

10.32732/jmo.2020.12.1.51 ◽

2020 ◽

Vol 12 (1) ◽

pp. 51-59

Author(s):

A. A. Moskvin ◽

A.G. Shishkin

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Visual Information ◽

Network Architecture ◽

Neural Network Architecture ◽

State Recognition ◽

Audio Features ◽

E Learning ◽

Audio Information ◽

Selection Of

Human emotions play significant role in everyday life. There are a lot of applications of automatic emotion recognition in medicine, e-learning, monitoring, marketing etc. In this paper the method and neural network architecture for real-time human emotion recognition by audio-visual data are proposed. To classify one of seven emotions, deep neural networks, namely, convolutional and recurrent neural networks are used. Visual information is represented by a sequence of 16 frames of 96 × 96 pixels, and audio information - by 140 features for each of a sequence of 37 temporal windows. To reduce the number of audio features autoencoder was used. Audio information in conjunction with visual one is shown to increase recognition accuracy up to 12%. The developed system being not demanding to be computing resources is dynamic in terms of selection of parameters, reducing or increasing the number of emotion classes, as well as the ability to easily add, accumulate and use information from other external devices for further improvement of classification accuracy.

Download Full-text