Dimensional Music Emotion Recognition by Machine Learning

Author(s):  
Junjie Bai ◽  
Lixiao Feng ◽  
Jun Peng ◽  
Jinliang Shi ◽  
Kan Luo ◽  
...  

Music emotion recognition (MER) is a challenging field of studies that has been addressed in multiple disciplines such as cognitive science, physiology, psychology, musicology, and arts. In this paper, music emotions are modeled as a set of continuous variables composed of valence and arousal (VA) values based on the Valence-Arousal model. MER is formulated as a regression problem where 548 dimensions of music features were extracted and selected. A wide range of methods including multivariate adaptive regression spline, support vector regression (SVR), radial basis function, random forest regression (RFR), and regression neural networks are adopted to recognize music emotions. Experimental results show that these regression algorithms have led to good regression effect for MER. The optimal R2 statistics and VA values are 29.3% and 62.5%, respectively, which are obtained by the RFR and SVR algorithms in the relief feature space.

PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0253383
Author(s):  
Badar Almarri ◽  
Sanguthevar Rajasekaran ◽  
Chun-Hsi Huang

The dimensionality of the spatially distributed channels and the temporal resolution of electroencephalogram (EEG) based brain-computer interfaces (BCI) undermine emotion recognition models. Thus, prior to modeling such data, as the final stage of the learning pipeline, adequate preprocessing, transforming, and extracting temporal (i.e., time-series signals) and spatial (i.e., electrode channels) features are essential phases to recognize underlying human emotions. Conventionally, inter-subject variations are dealt with by avoiding the sources of variation (e.g., outliers) or turning the problem into a subject-deponent. We address this issue by preserving and learning from individual particularities in response to affective stimuli. This paper investigates and proposes a subject-independent emotion recognition framework that mitigates the subject-to-subject variability in such systems. Using an unsupervised feature selection algorithm, we reduce the feature space that is extracted from time-series signals. For the spatial features, we propose a subject-specific unsupervised learning algorithm that learns from inter-channel co-activation online. We tested this framework on real EEG benchmarks, namely DEAP, MAHNOB-HCI, and DREAMER. We train and test the selection outcomes using nested cross-validation and a support vector machine (SVM). We compared our results with the state-of-the-art subject-independent algorithms. Our results show an enhanced performance by accurately classifying human affection (i.e., based on valence and arousal) by 16%–27% compared to other studies. This work not only outperforms other subject-independent studies reported in the literature but also proposes an online analysis solution to affection recognition.


2016 ◽  
Vol 7 (1) ◽  
pp. 58-68 ◽  
Author(s):  
Imen Trabelsi ◽  
Med Salim Bouhlel

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.


2020 ◽  
Vol 13 (4) ◽  
pp. 4-24 ◽  
Author(s):  
V.A. Barabanschikov ◽  
E.V. Suvorova

The article is devoted to the results of approbation of the Geneva Emotion Recognition Test (GERT), a Swiss method for assessing dynamic emotional states, on Russian sample. Identification accuracy and the categorical fields’ structure of emotional expressions of a “living” face are analysed. Similarities and differences in the perception of affective groups of dynamic emotions in the Russian and Swiss samples are considered. A number of patterns of recognition of multi-modal expressions with changes in valence and arousal of emotions are described. Differences in the perception of dynamics and statics of emotional expressions are revealed. GERT method confirmed it’s high potential for solving a wide range of academic and applied problems.


Symmetry ◽  
2018 ◽  
Vol 10 (11) ◽  
pp. 627 ◽  
Author(s):  
Daniel Chalupa ◽  
Jan Mikulka

The rather impressive extension library of medical image-processing platform 3D Slicer lacks a wide range of machine-learning toolboxes. The authors have developed such a toolbox that incorporates commonly used machine-learning libraries. The extension uses a simple graphical user interface that allows the user to preprocess data, train a classifier, and use that classifier in common medical image-classification tasks, such as tumor staging or various anatomical segmentations without a deeper knowledge of the inner workings of the classifiers. A series of experiments were carried out to showcase the capabilities of the extension and quantify the symmetry between the physical characteristics of pathological tissues and the parameters of a classifying model. These experiments also include an analysis of the impact of training vector size and feature selection on the sensitivity and specificity of all included classifiers. The results indicate that training vector size can be minimized for all classifiers. Using the data from the Brain Tumor Segmentation Challenge, Random Forest appears to have the widest range of parameters that produce sufficiently accurate segmentations, while optimal Support Vector Machines’ training parameters are concentrated in a narrow feature space.


2020 ◽  
Vol 10 (5) ◽  
pp. 1619 ◽  
Author(s):  
Chao Pan ◽  
Cheng Shi ◽  
Honglang Mu ◽  
Jie Li ◽  
Xinbo Gao

Emotion plays a nuclear part in human attention, decision-making, and communication. Electroencephalogram (EEG)-based emotion recognition has developed a lot due to the application of Brain-Computer Interface (BCI) and its effectiveness compared to body expressions and other physiological signals. Despite significant progress in affective computing, emotion recognition is still an unexplored problem. This paper introduced Logistic Regression (LR) with Gaussian kernel and Laplacian prior for EEG-based emotion recognition. The Gaussian kernel enhances the EEG data separability in the transformed space. The Laplacian prior promotes the sparsity of learned LR regressors to avoid over-specification. The LR regressors are optimized using the logistic regression via variable splitting and augmented Lagrangian (LORSAL) algorithm. For simplicity, the introduced method is noted as LORSAL. Experiments were conducted on the dataset for emotion analysis using EEG, physiological and video signals (DEAP). Various spectral features and features by combining electrodes (power spectral density (PSD), differential entropy (DE), differential asymmetry (DASM), rational asymmetry (RASM), and differential caudality (DCAU)) were extracted from different frequency bands (Delta, Theta, Alpha, Beta, Gamma, and Total) with EEG signals. The Naive Bayes (NB), support vector machine (SVM), linear LR with L1-regularization (LR_L1), linear LR with L2-regularization (LR_L2) were used for comparison in the binary emotion classification for valence and arousal. LORSAL obtained the best classification accuracies (77.17% and 77.03% for valence and arousal, respectively) on the DE features extracted from total frequency bands. This paper also investigates the critical frequency bands in emotion recognition. The experimental results showed the superiority of Gamma and Beta bands in classifying emotions. It was presented that DE was the most informative and DASM and DCAU had lower computational complexity with relatively ideal accuracies. An analysis of LORSAL and the recently deep learning (DL) methods is included in the discussion. Conclusions and future work are presented in the final section.


Electronics ◽  
2021 ◽  
Vol 10 (23) ◽  
pp. 2891
Author(s):  
Shihan Huang ◽  
Hua Dang ◽  
Rongkun Jiang ◽  
Yue Hao ◽  
Chengbo Xue ◽  
...  

Speech Emotion Recognition (SER) plays a significant role in the field of Human–Computer Interaction (HCI) with a wide range of applications. However, there are still some issues in practical application. One of the issues is the difference between emotional expression amongst various individuals, and another is that some indistinguishable emotions may reduce the stability of the SER system. In this paper, we propose a multi-layer hybrid fuzzy support vector machine (MLHF-SVM) model, which includes three layers: feature extraction layer, pre-classification layer, and classification layer. The MLHF-SVM model solves the above-mentioned issues by fuzzy c-means (FCM) based on identification information of human and multi-layer SVM classifiers, respectively. In addition, to overcome the weakness that FCM tends to fall into local minima, an improved natural exponential inertia weight particle swarm optimization (IEPSO) algorithm is proposed and integrated with fuzzy c-means for optimization. Moreover, in the feature extraction layer, non-personalized features and personalized features are combined to improve accuracy. In order to verify the effectiveness of the proposed model, all emotions in three popular datasets are used for simulation. The results show that this model can effectively improve the success rate of classification and the maximum value of a single emotion recognition rate is 97.67% on the EmoDB dataset.


2019 ◽  
Vol 11 (5) ◽  
pp. 105 ◽  
Author(s):  
Yongrui Huang ◽  
Jianhao Yang ◽  
Siyu Liu ◽  
Jiahui Pan

Emotion recognition plays an essential role in human–computer interaction. Previous studies have investigated the use of facial expression and electroencephalogram (EEG) signals from single modal for emotion recognition separately, but few have paid attention to a fusion between them. In this paper, we adopted a multimodal emotion recognition framework by combining facial expression and EEG, based on a valence-arousal emotional model. For facial expression detection, we followed a transfer learning approach for multi-task convolutional neural network (CNN) architectures to detect the state of valence and arousal. For EEG detection, two learning targets (valence and arousal) were detected by different support vector machine (SVM) classifiers, separately. Finally, two decision-level fusion methods based on the enumerate weight rule or an adaptive boosting technique were used to combine facial expression and EEG. In the experiment, the subjects were instructed to watch clips designed to elicit an emotional response and then reported their emotional state. We used two emotion datasets—a Database for Emotion Analysis using Physiological Signals (DEAP) and MAHNOB-human computer interface (MAHNOB-HCI)—to evaluate our method. In addition, we also performed an online experiment to make our method more robust. We experimentally demonstrated that our method produces state-of-the-art results in terms of binary valence/arousal classification, based on DEAP and MAHNOB-HCI data sets. Besides this, for the online experiment, we achieved 69.75% accuracy for the valence space and 70.00% accuracy for the arousal space after fusion, each of which has surpassed the highest performing single modality (69.28% for the valence space and 64.00% for the arousal space). The results suggest that the combination of facial expressions and EEG information for emotion recognition compensates for their defects as single information sources. The novelty of this work is as follows. To begin with, we combined facial expression and EEG to improve the performance of emotion recognition. Furthermore, we used transfer learning techniques to tackle the problem of lacking data and achieve higher accuracy for facial expression. Finally, in addition to implementing the widely used fusion method based on enumerating different weights between two models, we also explored a novel fusion method, applying boosting technique.


Sensors ◽  
2020 ◽  
Vol 21 (1) ◽  
pp. 52
Author(s):  
Tianyi Zhang ◽  
Abdallah El Ali ◽  
Chen Wang ◽  
Alan Hanjalic ◽  
Pablo Cesar

Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≤64 Hz) (3) large amounts of neutral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance.


Agriculture ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 371
Author(s):  
Yu Jin ◽  
Jiawei Guo ◽  
Huichun Ye ◽  
Jinling Zhao ◽  
Wenjiang Huang ◽  
...  

The remote sensing extraction of large areas of arecanut (Areca catechu L.) planting plays an important role in investigating the distribution of arecanut planting area and the subsequent adjustment and optimization of regional planting structures. Satellite imagery has previously been used to investigate and monitor the agricultural and forestry vegetation in Hainan. However, the monitoring accuracy is affected by the cloudy and rainy climate of this region, as well as the high level of land fragmentation. In this paper, we used PlanetScope imagery at a 3 m spatial resolution over the Hainan arecanut planting area to investigate the high-precision extraction of the arecanut planting distribution based on feature space optimization. First, spectral and textural feature variables were selected to form the initial feature space, followed by the implementation of the random forest algorithm to optimize the feature space. Arecanut planting area extraction models based on the support vector machine (SVM), BP neural network (BPNN), and random forest (RF) classification algorithms were then constructed. The overall classification accuracies of the SVM, BPNN, and RF models optimized by the RF features were determined as 74.82%, 83.67%, and 88.30%, with Kappa coefficients of 0.680, 0.795, and 0.853, respectively. The RF model with optimized features exhibited the highest overall classification accuracy and kappa coefficient. The overall accuracy of the SVM, BPNN, and RF models following feature optimization was improved by 3.90%, 7.77%, and 7.45%, respectively, compared with the corresponding unoptimized classification model. The kappa coefficient also improved. The results demonstrate the ability of PlanetScope satellite imagery to extract the planting distribution of arecanut. Furthermore, the RF is proven to effectively optimize the initial feature space, composed of spectral and textural feature variables, further improving the extraction accuracy of the arecanut planting distribution. This work can act as a theoretical and technical reference for the agricultural and forestry industries.


Sign in / Sign up

Export Citation Format

Share Document