scholarly journals Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition

Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5452
Author(s):  
Xin Chang ◽  
Władysław Skarbek

Emotion recognition is an important research field for human–computer interaction. Audio–video emotion recognition is now attacked with deep neural network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases of superiority in uni-modality that can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better performance in the existing late fusion and end-to-end multi-modal network training strategies. To take advantage of and overcome the deficiencies in both solutions, we define a multi-modal residual perceptron network which performs end-to-end learning from multi-modal network branches, generalizing better multi-modal feature representation. For the proposed multi-modal residual perceptron network and the novel time augmentation for streaming digital movies, the state-of-the-art average recognition rate was improved to 91.4% for the Ryerson Audio–Visual Database of Emotional Speech and Song dataset and to 83.15% for the Crowd-Sourced Emotional Multi Modal Actors dataset. Moreover, the multi-modal residual perceptron network concept shows its potential for multi-modal applications dealing with signal sources not only of optical and acoustical types.

2011 ◽  
Vol 84-85 ◽  
pp. 373-377
Author(s):  
Wei Zhang Wang

The present solutions of well cementing are mostly designed by designers’ experience and calculation which can not predict the engineering quality after application of the designs. Meanwhile some questions in the designs can not be solved before construction. On the basis of detailed evaluation of every influential factor according to construction and environmental conditions, this article provides cementing fuzzy neural network model by means of 2nsoftEditor neural network modeling tools, and the stable software systems with the combination of artificial neural network and fuzzy logic rules are expected to improve the credibility of cementing quality prediction. Construction practice shows that cementing quality prediction with application of fuzzy neural network system before cementing can greatly reduce the cementing costs and improve the cementing success ratio.


Author(s):  
Xinge Zhu ◽  
Liang Li ◽  
Weigang Zhang ◽  
Tianrong Rao ◽  
Min Xu ◽  
...  

Visual emotion recognition aims to associate images with appropriate emotions. There are different visual stimuli that can affect human emotion from low-level to high-level, such as color, texture, part, object, etc. However, most existing methods treat different levels of features as independent entity without having effective method for feature fusion. In this paper, we propose a unified CNN-RNN model to predict the emotion based on the fused features from different levels by exploiting the dependency among them. Our proposed architecture leverages convolutional neural network (CNN) with multiple layers to extract different levels of features with in a multi-task learning framework, in which two related loss functions are introduced to learn the feature representation. Considering the dependencies within the low-level and high-level features, a new bidirectional recurrent neural network (RNN) is proposed to integrate the learned features from different layers in the CNN model. Extensive experiments on both Internet images and art photo datasets demonstrate that our method outperforms the state-of-the-art methods with at least 7% performance improvement.


2016 ◽  
Vol 140 (4) ◽  
pp. 3116-3116
Author(s):  
Hitoshi Ito ◽  
Aiko Hagiwara ◽  
Manon Ichiki ◽  
Takeshi Mishima ◽  
Shoei Sato ◽  
...  

Author(s):  
Duowei Tang ◽  
Peter Kuppens ◽  
Luc Geurts ◽  
Toon van Waterschoot

AbstractAmongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.


2021 ◽  
Vol 11 (11) ◽  
pp. 4782
Author(s):  
Huan-Chung Li ◽  
Telung Pan ◽  
Man-Hua Lee ◽  
Hung-Wen Chiu

In recent years, many types of research have continued to improve the environment of human speech and emotion recognition. As facial emotion recognition has gradually matured through speech recognition, the result of this study provided more accurate recognition of complex human emotional performance, and speech emotion identification will be derived from human subjective interpretation into the use of computers to automatically interpret the speaker’s emotional expression. Focused on use in medical care, which can be used to understand the current feelings of physicians and patients during a visit, and improve the medical treatment through the relationship between illness and interaction. By transforming the voice data into a single observation segment per second, the first to the thirteenth dimensions of the frequency cestrum coefficients are used as speech emotion recognition eigenvalue vectors. Vectors for the eigenvalue vectors are maximum, minimum, average, median, and standard deviation, and there are 65 eigenvalues in total for the construction of an artificial neural network. The sentiment recognition system developed by the hospital is used as a comparison between the sentiment recognition results of the artificial neural network classification, and then use the foregoing results for a comprehensive analysis to understand the interaction between the doctor and the patient. Using this experimental module, the emotion recognition rate is 93.34%, and the accuracy rate of facial emotion recognition results can be 86.3%.


2022 ◽  
Vol 12 ◽  
Author(s):  
Xiaofeng Lu

This exploration aims to study the emotion recognition of speech and graphic visualization of expressions of learners under the intelligent learning environment of the Internet. After comparing the performance of several neural network algorithms related to deep learning, an improved convolution neural network-Bi-directional Long Short-Term Memory (CNN-BiLSTM) algorithm is proposed, and a simulation experiment is conducted to verify the performance of this algorithm. The experimental results indicate that the Accuracy of CNN-BiLSTM algorithm reported here reaches 98.75%, which is at least 3.15% higher than that of other algorithms. Besides, the Recall is at least 7.13% higher than that of other algorithms, and the recognition rate is not less than 90%. Evidently, the improved CNN-BiLSTM algorithm can achieve good recognition results, and provide significant experimental reference for research on learners’ emotion recognition and graphic visualization of expressions in an intelligent learning environment.


Author(s):  
Hong Zhao ◽  
Lupeng Yue ◽  
Weijie Wang ◽  
Zeng Xiangyan

Speech signal is a time-varying signal, which is greatly affected by individual and environment. In order to improve the end-to-end voice print recognition rate, it is necessary to preprocess the original speech signal to some extent. An end-to-end voiceprint recognition algorithm based on convolutional neural network is proposed. In this algorithm, the convolution and down-sampling of convolutional neural network are used to preprocess the speech signals in end-to-end voiceprint recognition. The one-dimensional and two-dimensional convolution operations were established to extract the characteristic parameters of Meier frequency cepstrum coefficient from the preprocessed signals, and the classical universal background model was used to model the recognition model of voice print. In this study, the principle of end-to-end voiceprint recognition was firstly analyzed, and the process of end-to-end voice print recognition, end-to-end voice print recognition features and Res-FD-CNN network structure were studied. Then the convolutional neural network recognition model was constructed, and the data were preprocessed to form the convolutional layer in frequency domain and the algorithm was tested.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
JinGen Tang

This paper investigates the extraction of volleyball players’ skeleton information and provides a deep learning-based solution for recognizing the players’ actions. For this purpose, the convolutional neural network-based approach for recognizing volleyball players’ actions is used. The Lie group skeleton has a large data dimension since it is used to represent the features retrieved from the model. The convolutional neural network is used for feature learning and classification in order to process high-dimensional data, minimize the complexity of the recognition process, and speed up the calculation. This paper uses the Lie group skeleton representation model to extract the geometric feature of the skeleton information in the feature extraction stage and the geometric transformation (rotation and translation) between different limbs to represent the volleyball players’ movements in the feature representation stage. The approach is evaluated using the datasets Florence3D actions, MSR action pairs, and UTKinect action. The average recognition rate of our method is 93.00%, which is higher than that of the existing literature with high attention and reflects better accuracy and robustness.


Sign in / Sign up

Export Citation Format

Share Document