The Impact of Attention Mechanisms on Speech Emotion Recognition

Speech emotion recognition (SER) plays an important role in real-time applications of human-machine interaction. The Attention Mechanism is widely used to improve the performance of SER. However, the applicable rules of attention mechanism are not deeply discussed. This paper discussed the difference between Global-Attention and Self-Attention and explored their applicable rules to SER classification construction. The experimental results show that the Global-Attention can improve the accuracy of the sequential model, while the Self-Attention can improve the accuracy of the parallel model when conducting the model with the CNN and the LSTM. With this knowledge, a classifier (CNN-LSTM×2+Global-Attention model) for SER is proposed. The experiments result show that it could achieve an accuracy of 85.427% on the EMO-DB dataset.

Download Full-text

Databases, Features and Classification Techniques for Speech Emotion Recognition

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3487.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 185-190

Keyword(s):

Emotion Recognition ◽

Research Area ◽

Research Field ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Human Machine Interaction ◽

Classification Techniques ◽

Classification Feature ◽

Physical Gestures ◽

Machine Interaction

Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.

Download Full-text

Speech Emotion Recognition Using MLP Classifier

International Journal of Scientific Research in Science and Technology ◽

10.32628/cseit217446 ◽

2021 ◽

pp. 218-222

Author(s):

Nagaraja N Poojary ◽

Dr. Shivakumar G S ◽

Akshath Kumar B.H

Keyword(s):

Neural Network ◽

Computer Vision ◽

Social Interaction ◽

Emotion Recognition ◽

Basic Medium ◽

Speech Emotion Recognition ◽

Speech Sample ◽

Human Machine Interaction ◽

Mlp Classifier ◽

Machine Interaction

Language is human's most important communication and speech is basic medium of communication. Emotion plays a crucial role in social interaction. Recognizing the emotion in a speech is important as well as challenging because here we are dealing with human machine interaction. Emotion varies from person to person were same person have different emotions all together has different way express it. When a person express his emotion each will be having different energy, pitch and tone variation are grouped together considering upon different subject. Therefore the speech emotion recognition is a future goal of computer vision. The aim of our project is to develop the smart emotion recognition speech based on the convolutional neural network. Which uses different modules for emotion recognition and the classifier are used to differentiate emotion such as happy sad angry surprise. The machine will convert the human speech signals into waveform and process its routine at last it will display the emotion. The data is speech sample and the characteristics are extracted from the speech sample using librosa package. We are using RAVDESS dataset which are used as an experimental dataset. This study shows that for our dataset all classifiers achieve an accuracy of 68%.

Download Full-text

A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

Electronics ◽

10.3390/electronics10101163 ◽

2021 ◽

Vol 10 (10) ◽

pp. 1163

Author(s):

Eva Lieskovská ◽

Maroš Jakubec ◽

Roman Jarina ◽

Michal Chmulík

Keyword(s):

Emotion Recognition ◽

User Satisfaction ◽

Deep Neural Networks ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Multimedia Content ◽

Human Interactions ◽

Benchmark Database ◽

The Impact ◽

Significant Factors

Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.

Download Full-text

Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment

Sensors ◽

10.3390/s20082297 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2297

Author(s):

Zhen-Tao Liu ◽

Bao-Han Wu ◽

Dan-Yun Li ◽

Peng Xiao ◽

Jun-Wei Mao

Keyword(s):

Emotion Recognition ◽

Feature Selection Method ◽

Sampling Technique ◽

Small Sample ◽

Speech Emotion Recognition ◽

Gradient Boosting ◽

Data Imbalance ◽

The Arts ◽

The Impact ◽

Sample Environment

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

Download Full-text

Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition

2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) ◽

10.1109/aciiw.2019.8925283 ◽

2019 ◽

Author(s):

Chao Li ◽

Jinlong Jiao ◽

Yiqin Zhao ◽

Ziping Zhao

Keyword(s):

Emotion Recognition ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Convolutional Networks

Download Full-text

Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model

2019 IEEE International Conference on Signals and Systems (ICSigSys) ◽

10.1109/icsigsys.2019.8811080 ◽

2019 ◽

Author(s):

Bagus Tris Atmaja ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Attention Model ◽

Speech Segment

Download Full-text

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Sensors ◽

10.3390/s20216008 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6008 ◽

Cited By ~ 1

Author(s):

Misbah Farooq ◽

Fawad Hussain ◽

Naveed Khan Baloch ◽

Fawad Riasat Raja ◽

Heejung Yu ◽

...

Keyword(s):

Neural Network ◽

Feature Selection ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Human Machine Interaction ◽

Speaker Independent

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Download Full-text

CNN-based Speech Emotion Recognition Model Applying Transfer Learning and Attention Mechanism

Journal of KIISE ◽

10.5626/jok.2020.47.7.665 ◽

2020 ◽

Vol 47 (7) ◽

pp. 665-673

Author(s):

Jung Hyun Lee ◽

Ui Nyoung Yoon ◽

Geun-Sik Jo

Keyword(s):

Emotion Recognition ◽

Transfer Learning ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Recognition Model

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v9i2.22983 ◽

2021 ◽

Vol 9 (2) ◽

pp. 87

Author(s):

Shreya Kumar ◽

Swarnalaxmi Thiruvenkadam

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Prediction Accuracy ◽

Speech Emotion Recognition ◽

German Language ◽

Spectral Contrast ◽

Recognition Systems ◽

The Impact ◽

Contrast Feature

Feature extraction is an integral part in speech emotion recognition. Some emotions become indistinguishable from others due to high resemblance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset and JL corpus dataset were also used to test its performance over different English accents. In addition to that, EmoDB dataset has been used to study its performance in the German language. The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences in arousal levels, and it has been discussed in detail.<div> </div>

Download Full-text