Speech Emotion Recognition using Time Distributed CNN and LSTM

Speech has several distinguishing characteristic features which has remained a state-of-the-art tool for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system using these speech features, which would be able to accurately and efficiently recognize emotions through audio analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train the neural network. We have been able to achieve a testing accuracy of about 89.26%.

Download Full-text

Audio-Textual Emotion Recognition Based on Improved Neural Networks

Mathematical Problems in Engineering ◽

10.1155/2019/2593036 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Linqin Cai ◽

Yaxin Hu ◽

Jiangong Dong ◽

Sitong Zhou

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Recognition System ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Emotional Recognition ◽

Long Short Term Memory

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

Download Full-text

RNN-based Dimensional Speech Emotion Recognition

10.31227/osf.io/wa3vp ◽

2020 ◽

Author(s):

Bagus Tris Atmaja

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Mean Squared Error ◽

Absolute Error ◽

Recognition System ◽

Speech Emotion Recognition ◽

Percentage Error ◽

Concordance Correlation ◽

Acoustic Feature ◽

Dense System

◆ A speech emotion recognition system based on recurrent neural networks is developed using long short-term memory networks.◆ Two of acoustic feature sets are evaluated: 31 Features (3 time-domain features, 5 frequency-domain features, 13 MFCCs, 5 F0s, and 5 Harmonics) and eGeMaps feature set (23 features).◆ To evaluate the performance, some metrics are used i.e. mean squared error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE) and concordance correlation coefficient (CCC). Among those metrics, CCC is main focus as it is used by other researchers.◆ The developed system used multi-task learning to maximize arousal, valence, and dominance at the same time using CCC loss (1 - CCC). The result shows using LSTM networks improve the CCC score compared to baseline dense system. The best CCC score isobtained on arousal followed by dominance and valence.

Download Full-text

Deep Learning based Speech Emotion Recognition System

Journal of University of Shanghai for Science and Technology ◽

10.51201/jusst/21/121003 ◽

2021 ◽

Vol 23 (12) ◽

pp. 212-223

Author(s):

P Jothi Thilaga ◽

◽

S Kavipriya ◽

K Vijayalakshmi ◽

◽

...

Keyword(s):

Neural Network ◽

Decision Making ◽

Emotion Recognition ◽

Recognition System ◽

Recognition Algorithm ◽

Speech Emotion Recognition ◽

Natural Interaction ◽

Everyday Activities ◽

Verbal Content ◽

Voice Interaction

Emotions are elementary for humans, impacting perception and everyday activities like communication, learning and decision-making. Speech emotion Recognition (SER) systems aim to facilitate the natural interaction with machines by direct voice interaction rather than exploitation ancient devices as input to know verbal content and build it straightforward for human listeners to react. During this SER system primarily composed of 2 sections called feature extraction and feature classification phase. SER implements on bots to speak with humans during a non-lexical manner. The speech emotion recognition algorithm here is predicated on the Convolutional Neural Network (CNN) model, which uses varied modules for emotion recognition and classifiers to differentiate feelings like happiness, calm, anger, neutral state, sadness, and fear. The accomplishment of classification is predicated on extracted features. Finally, the emotion of a speech signal will be determined.

Download Full-text

Make Patient Consultation Warmer: A Clinical Application for Speech Emotion Recognition

Applied Sciences ◽

10.3390/app11114782 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4782

Author(s):

Huan-Chung Li ◽

Telung Pan ◽

Man-Hua Lee ◽

Hung-Wen Chiu

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Emotion Recognition ◽

Recognition Rate ◽

Recognition System ◽

Facial Emotion Recognition ◽

Facial Emotion ◽

Speech Emotion Recognition ◽

Single Observation ◽

Artificial Neural

In recent years, many types of research have continued to improve the environment of human speech and emotion recognition. As facial emotion recognition has gradually matured through speech recognition, the result of this study provided more accurate recognition of complex human emotional performance, and speech emotion identification will be derived from human subjective interpretation into the use of computers to automatically interpret the speaker’s emotional expression. Focused on use in medical care, which can be used to understand the current feelings of physicians and patients during a visit, and improve the medical treatment through the relationship between illness and interaction. By transforming the voice data into a single observation segment per second, the first to the thirteenth dimensions of the frequency cestrum coefficients are used as speech emotion recognition eigenvalue vectors. Vectors for the eigenvalue vectors are maximum, minimum, average, median, and standard deviation, and there are 65 eigenvalues in total for the construction of an artificial neural network. The sentiment recognition system developed by the hospital is used as a comparison between the sentiment recognition results of the artificial neural network classification, and then use the foregoing results for a comprehensive analysis to understand the interaction between the doctor and the patient. Using this experimental module, the emotion recognition rate is 93.34%, and the accuracy rate of facial emotion recognition results can be 86.3%.

Download Full-text

COMPARISON OF OPTIMIZATION ALGORITHMS OF CONNECTIONIST TEMPORAL CLASSIFIER FOR SPEECH RECOGNITION SYSTEM

Informatyka Automatyka Pomiary w Gospodarce i Ochronie Środowiska ◽

10.35784/iapgos.234 ◽

2019 ◽

Vol 9 (3) ◽

pp. 54-57

Author(s):

Yedilkhan Amirgaliyev ◽

Kuanyshbay Kuanyshbay ◽

Aisultan Shoiynbek

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Short Term Memory ◽

Optimization Algorithms ◽

Recognition System ◽

Model Data ◽

Short Term ◽

Term Memory ◽

The Neural Network ◽

Long Short Term Memory

This paper evaluates and compares the performances of three well-known optimization algorithms (Adagrad, Adam, Momentum) for faster training the neural network of CTC algorithm for speech recognition. For CTC algorithms recurrent neural network has been used, specifically Long-Short-Term memory. LSTM is effective and often used model. Data has been downloaded from VCTK corpus of Edinburgh University. The results of optimization algorithms have been evaluated by the Label error rate and CTC loss.

Download Full-text

A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

Applied Sciences ◽

10.3390/app11041890 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1890

Author(s):

Sung-Woo Byun ◽

Seok-Pil Lee

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Network Model ◽

Recurrent Neural Network ◽

Neural Network Model ◽

Recognition Performance ◽

Recognition System ◽

Speech Emotion Recognition ◽

Acoustic Features ◽

Speech Database

The goal of the human interface is to recognize the user’s emotional state precisely. In the speech emotion recognition study, the most important issue is the effective parallel use of the extraction of proper speech features and an appropriate classification engine. Well defined speech databases are also needed to accurately recognize and analyze emotions from speech signals. In this work, we constructed a Korean emotional speech database for speech emotion analysis and proposed a feature combination that can improve emotion recognition performance using a recurrent neural network model. To investigate the acoustic features, which can reflect distinct momentary changes in emotional expression, we extracted F0, Mel-frequency cepstrum coefficients, spectral features, harmonic features, and others. Statistical analysis was performed to select an optimal combination of acoustic features that affect the emotion from speech. We used a recurrent neural network model to classify emotions from speech. The results show the proposed system has more accurate performance than previous studies.

Download Full-text

Speech emotion recognition system based on genetic algorithm and neural network

2011 International Conference on Image Analysis and Signal Processing ◽

10.1109/iasp.2011.6109110 ◽

2011 ◽

Cited By ~ 1

Author(s):

Jian Wang ◽

Zhiyan Han ◽

Shuxian Lun

Keyword(s):

Neural Network ◽

Genetic Algorithm ◽

Emotion Recognition ◽

Recognition System ◽

Speech Emotion Recognition

Download Full-text

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Frontiers in Physiology ◽

10.3389/fphys.2021.643202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hua Zhang ◽

Ruoyun Gou ◽

Jili Shang ◽

Fangyao Shen ◽

Yifan Wu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Convolution Neural Network ◽

Classification Model ◽

Speech Emotion Recognition ◽

Deep Convolution Neural Network ◽

Long Short Term Memory ◽

High Level ◽

Better Than

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Download Full-text

An Ensemble Model for Multi-Level Speech Emotion Recognition

Applied Sciences ◽

10.3390/app10010205 ◽

2019 ◽

Vol 10 (1) ◽

pp. 205 ◽

Cited By ~ 5

Author(s):

Chunjun Zheng ◽

Chunli Wang ◽

Ning Jia

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Ensemble Learning ◽

Short Term Memory ◽

Learning Model ◽

Local Features ◽

Speech Emotion Recognition ◽

Model Design ◽

Local Data ◽

Global Features

Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.

Download Full-text

Speech Emotion Recognition System Based on BP Neural Network in Matlab Environment

Lecture Notes in Computer Science - Advances in Neural Networks - ISNN 2008 ◽

10.1007/978-3-540-87734-9_91 ◽

2008 ◽

pp. 801-808 ◽

Cited By ~ 1

Author(s):

Guobao Zhang ◽

Qinghua Song ◽

Shumin Fei

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Bp Neural Network ◽

Recognition System ◽

Speech Emotion Recognition

Download Full-text