scholarly journals A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

2021 ◽  
Vol 12 (1) ◽  
pp. 327
Author(s):  
Cristina Luna-Jiménez ◽  
Ricardo Kleinlein ◽  
David Griol ◽  
Zoraida Callejas ◽  
Juan M. Montero ◽  
...  

Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.

Sensors ◽  
2021 ◽  
Vol 21 (22) ◽  
pp. 7665
Author(s):  
Cristina Luna-Jiménez ◽  
David Griol ◽  
Zoraida Callejas ◽  
Ricardo Kleinlein ◽  
Juan M. Montero ◽  
...  

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.


Author(s):  
Yiqin Yang ◽  
Zhe Wu ◽  
Qingyang Xu ◽  
Fabao Yan

Deep neural network (DNN) has many advantages. Autonomous driving has become a popular topic now. In this paper, an improved stack autoencoder based on the deep learning techniques is proposed to learn the driving characteristics of an autonomous car. These techniques realize the input data adjustment and solving diffusion gradient problem. A Raspberry Pi and a camera module are mounted on the top of the car. The camera module provides the images needed for training the DNN. There are two stages in the training. In the pre-training process, an improved autoencoder is trained by the unsupervised learning mechanism, and the characterization of the track is extracted. In the fine-tuning stage, the whole network is trained according to the labeled data, and then this model learns the driving characteristics better according to the samples. In the experimental stage, the car will predict the action of the car by the trained model in the autonomous mode. The experiment exhibits the effectiveness of the proposed model. Compared with the traditional neural network, the improved stack autoencoder has a better generalization ability and faster convergence speed.


2013 ◽  
Vol 61 (1) ◽  
pp. 7-15 ◽  
Author(s):  
Daniel Dittrich ◽  
Gregor Domes ◽  
Susi Loebel ◽  
Christoph Berger ◽  
Carsten Spitzer ◽  
...  

Die vorliegende Studie untersucht die Hypothese eines mit Alexithymie assoziierten Defizits beim Erkennen emotionaler Gesichtsaudrücke an einer klinischen Population. Darüber hinaus werden Hypothesen zur Bedeutung spezifischer Emotionsqualitäten sowie zu Gender-Unterschieden getestet. 68 ambulante und stationäre psychiatrische Patienten (44 Frauen und 24 Männer) wurden mit der Toronto-Alexithymie-Skala (TAS-20), der Montgomery-Åsberg Depression Scale (MADRS), der Symptom-Check-List (SCL-90-R) und der Emotional Expression Multimorph Task (EEMT) untersucht. Als Stimuli des Gesichtererkennungsparadigmas dienten Gesichtsausdrücke von Basisemotionen nach Ekman und Friesen, die zu Sequenzen mit sich graduell steigernder Ausdrucksstärke angeordnet waren. Mittels multipler Regressionsanalyse untersuchten wir die Assoziation von TAS-20 Punktzahl und facial emotion recognition (FER). Während sich für die Gesamtstichprobe und den männlichen Stichprobenteil kein signifikanter Zusammenhang zwischen TAS-20-Punktzahl und FER zeigte, sahen wir im weiblichen Stichprobenteil durch die TAS-20 Punktzahl eine signifikante Prädiktion der Gesamtfehlerzahl (β = .38, t = 2.055, p < 0.05) und den Fehlern im Erkennen der Emotionen Wut und Ekel (Wut: β = .40, t = 2.240, p < 0.05, Ekel: β = .41, t = 2.214, p < 0.05). Für wütende Gesichter betrug die Varianzaufklärung durch die TAS-20-Punktzahl 13.3 %, für angeekelte Gesichter 19.7 %. Kein Zusammenhang bestand zwischen der Zeit, nach der die Probanden die emotionalen Sequenzen stoppten, um ihre Bewertung abzugeben (Antwortlatenz) und Alexithymie. Die Ergebnisse der Arbeit unterstützen das Vorliegen eines mit Alexithymie assoziierten Defizits im Erkennen emotionaler Gesichtsausdrücke bei weiblchen Probanden in einer heterogenen, klinischen Stichprobe. Dieses Defizit könnte die Schwierigkeiten Hochalexithymer im Bereich sozialer Interaktionen zumindest teilweise begründen und so eine Prädisposition für psychische sowie psychosomatische Erkrankungen erklären.


2017 ◽  
Vol 32 (8) ◽  
pp. 698-709 ◽  
Author(s):  
Ryan Sutcliffe ◽  
Peter G. Rendell ◽  
Julie D. Henry ◽  
Phoebe E. Bailey ◽  
Ted Ruffman

2020 ◽  
Vol 35 (2) ◽  
pp. 295-315 ◽  
Author(s):  
Grace S. Hayes ◽  
Skye N. McLennan ◽  
Julie D. Henry ◽  
Louise H. Phillips ◽  
Gill Terrett ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document