Kernel Fusion of Audio and Visual Information for Emotion Recognition

Author(s):  
Yongjin Wang ◽  
Rui Zhang ◽  
Ling Guan ◽  
A. N. Venetsanopoulos
2020 ◽  
Vol 12 (1) ◽  
pp. 51-59
Author(s):  
A. A. Moskvin ◽  
A.G. Shishkin

Human emotions play significant role in everyday life. There are a lot of applications of automatic emotion recognition in medicine, e-learning, monitoring, marketing etc. In this paper the method and neural network architecture for real-time human emotion recognition by audio-visual data are proposed. To classify one of seven emotions, deep neural networks, namely, convolutional and recurrent neural networks are used. Visual information is represented by a sequence of 16 frames of 96 × 96 pixels, and audio information - by 140 features for each of a sequence of 37 temporal windows. To reduce the number of audio features autoencoder was used. Audio information in conjunction with visual one is shown to increase recognition accuracy up to 12%. The developed system being not demanding to be computing resources is dynamic in terms of selection of parameters, reducing or increasing the number of emotion classes, as well as the ability to easily add, accumulate and use information from other external devices for further improvement of classification accuracy. 


2021 ◽  
Vol 25 ◽  
pp. 233121652110453
Author(s):  
Minke J. de Boer ◽  
Tim Jürgens ◽  
Deniz Başkent ◽  
Frans W. Cornelissen

Since emotion recognition involves integration of the visual and auditory signals, it is likely that sensory impairments worsen emotion recognition. In emotion recognition, young adults can compensate for unimodal sensory degradations if the other modality is intact. However, most sensory impairments occur in the elderly population and it is unknown whether older adults are similarly capable of compensating for signal degradations. As a step towards studying potential effects of real sensory impairments, this study examined how degraded signals affect emotion recognition in older adults with normal hearing and vision. The degradations were designed to approximate some aspects of sensory impairments. Besides emotion recognition accuracy, we recorded eye movements to capture perceptual strategies for emotion recognition. Overall, older adults were as good as younger adults at integrating auditory and visual information and at compensating for degraded signals. However, accuracy was lower overall for older adults, indicating that aging leads to a general decrease in emotion recognition. In addition to decreased accuracy, older adults showed smaller adaptations of perceptual strategies in response to video degradations. Concluding, this study showed that emotion recognition declines with age, but that integration and compensation abilities are retained. In addition, we speculate that the reduced ability of older adults to adapt their perceptual strategies may be related to the increased time it takes them to direct their attention to scene aspects that are relatively far away from fixation.


2021 ◽  
Vol 11 (17) ◽  
pp. 7962
Author(s):  
Panagiotis Koromilas ◽  
Theodoros Giannakopoulos

This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.


Author(s):  
Manisha S* ◽  
H Saida Nafisa ◽  
Nandita Gopal ◽  
Roshni P Anand

The predominant communication channel to convey relevant and high impact information is the emotions that is embedded on our communications. Researchers have tried to exploit these emotions in recent years for human robot interactions (HRI) and human computer interactions (HCI). Emotion recognition through speech or through facial expression is termed as single mode emotion recognition. The rate of accuracy of these single mode emotion recognitions are improved using the proposed bimodal method by combining the modalities of speech and facing and recognition of emotions using a Convolutional Neural Network (CNN) model. In this paper, the proposed bimodal emotion recognition system, contains three major parts such as processing of audio, processing of video and fusion of data for detecting the emotion of a person. The fusion of visual information and audio data obtained from two different channels enhances the emotion recognition rate by providing the complementary data. The proposed method aims to classify 7 basic emotions (anger, disgust, fear, happy, neutral, sad, surprise) from an input video. We take audio and image frame from the video input to predict the final emotion of a person. The dataset used is an audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception. Dataset used here is RAVDESS dataset which contains audio-visual dataset, visual dataset and audio dataset. For bimodal emotion detection the audio-visual dataset is used.


Autism ◽  
2019 ◽  
Vol 24 (1) ◽  
pp. 258-262 ◽  
Author(s):  
Melissa H Black ◽  
Nigel TM Chen ◽  
Ottmar V Lipp ◽  
Sven Bölte ◽  
Sonya Girdler

While altered gaze behaviour during facial emotion recognition has been observed in autistic individuals, there remains marked inconsistency in findings, with the majority of previous research focused towards the processing of basic emotional expressions. There is a need to examine whether atypical gaze during facial emotion recognition extends to more complex emotional expressions, which are experienced as part of everyday social functioning. The eye gaze of 20 autistic and 20 IQ-matched neurotypical adults was examined during a facial emotion recognition task of complex, dynamic emotion displays. Autistic adults fixated longer on the mouth region when viewing complex emotions compared to neurotypical adults, indicating that altered prioritization of visual information may contribute to facial emotion recognition impairment. Results confirm the need for more ecologically valid stimuli for the elucidation of the mechanisms underlying facial emotion recognition difficulty in autistic individuals.


Sign in / Sign up

Export Citation Format

Share Document