Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Frontiers in Physiology ◽

10.3389/fphys.2021.643202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hua Zhang ◽

Ruoyun Gou ◽

Jili Shang ◽

Fangyao Shen ◽

Yifan Wu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Convolution Neural Network ◽

Classification Model ◽

Speech Emotion Recognition ◽

Deep Convolution Neural Network ◽

Long Short Term Memory ◽

High Level ◽

Better Than

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Download Full-text

Audio-Textual Emotion Recognition Based on Improved Neural Networks

Mathematical Problems in Engineering ◽

10.1155/2019/2593036 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Linqin Cai ◽

Yaxin Hu ◽

Jiangong Dong ◽

Sitong Zhou

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Recognition System ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Emotional Recognition ◽

Long Short Term Memory

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

Download Full-text

Speech emotion recognition using convolutional long short-term memory neural network and support vector machines

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipa.2017.8282315 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nattapong Kurpukdee ◽

Tomoki Koriyama ◽

Takao Kobayashi ◽

Sawit Kasuriya ◽

Chai Wutiwiwatchai ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machines ◽

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Support Vector ◽

Short Term ◽

Term Memory ◽

Vector Machines ◽

Long Short Term Memory

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text

Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients

Multimedia Tools and Applications ◽

10.1007/s11042-020-10329-2 ◽

2021 ◽

Author(s):

Manju D. Pawar ◽

Rajendra D. Kokate

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Convolution Neural Network ◽

Speech Emotion Recognition

Download Full-text

Hybrid technique for heart diseases diagnosis based on convolution neural network and long short-term memory

Applications of Big Data in Healthcare ◽

10.1016/b978-0-12-820203-6.00009-6 ◽

2021 ◽

pp. 261-280

Author(s):

Abdelmegeid Amin Ali ◽

Hassan Shaban Hassan ◽

Eman M. Anwar ◽

Ashish Khanna

Keyword(s):

Neural Network ◽

Short Term Memory ◽

Heart Diseases ◽

Convolution Neural Network ◽

Hybrid Technique ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Comparison Performance of Long Short-Term Memory and Convolution Neural Network Variants on Online Learning Tweet Sentiment Analysis

10.1007/978-981-16-7334-4_1 ◽

2021 ◽

pp. 3-17

Author(s):

Muhammad Syamil Ali ◽

Marina Yusoff

Keyword(s):

Neural Network ◽

Online Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Convolution Neural Network ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Chinese Text Classification Model Based on Deep Learning

Future Internet ◽

10.3390/fi10110113 ◽

2018 ◽

Vol 10 (11) ◽

pp. 113 ◽

Cited By ~ 17

Author(s):

Yue Li ◽

Xutao Wang ◽

Pengjian Xu

Keyword(s):

Neural Network ◽

Deep Learning ◽

Language Processing ◽

Chinese Text ◽

Text Classification ◽

Short Term Memory ◽

Classification Model ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Text classification is of importance in natural language processing, as the massive text information containing huge amounts of value needs to be classified into different categories for further use. In order to better classify text, our paper tries to build a deep learning model which achieves better classification results in Chinese text than those of other researchers’ models. After comparing different methods, long short-term memory (LSTM) and convolutional neural network (CNN) methods were selected as deep learning methods to classify Chinese text. LSTM is a special kind of recurrent neural network (RNN), which is capable of processing serialized information through its recurrent structure. By contrast, CNN has shown its ability to extract features from visual imagery. Therefore, two layers of LSTM and one layer of CNN were integrated to our new model: the BLSTM-C model (BLSTM stands for bi-directional long short-term memory while C stands for CNN.) LSTM was responsible for obtaining a sequence output based on past and future contexts, which was then input to the convolutional layer for extracting features. In our experiments, the proposed BLSTM-C model was evaluated in several ways. In the results, the model exhibited remarkable performance in text classification, especially in Chinese texts.

Download Full-text

Bimodal Emotion Recognition Model for Minnan Songs

Information ◽

10.3390/info11030145 ◽

2020 ◽

Vol 11 (3) ◽

pp. 145 ◽

Cited By ~ 1

Author(s):

Zhenglong Xiang ◽

Xialei Dong ◽

Yuanxiang Li ◽

Fei Yu ◽

Xing Xu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Music Appreciation ◽

Research Papers ◽

Audio Features ◽

Analysis Theory ◽

Proposed Model ◽

Song Lyrics ◽

Long Short Term Memory

Most of the existing research papers study the emotion recognition of Minnan songs from the perspectives of music analysis theory and music appreciation. However, these investigations do not explore any possibility of carrying out an automatic emotion recognition of Minnan songs. In this paper, we propose a model that consists of four main modules to classify the emotion of Minnan songs by using the bimodal data—song lyrics and audio. In the proposed model, an attention-based Long Short-Term Memory (LSTM) neural network is applied to extract lyrical features, and a Convolutional Neural Network (CNN) is used to extract the audio features from the spectrum. Then, two kinds of extracted features are concatenated by multimodal compact bilinear pooling, and finally, the concatenated features are input to the classifying module to determine the song emotion. We designed three experiment groups to investigate the classifying performance of combinations of the four main parts, the comparisons of proposed model with the current approaches and the influence of a few key parameters on the performance of emotion recognition. The results show that the proposed model exhibits better performance over all other experimental groups. The accuracy, precision and recall of the proposed model exceed 0.80 in a combination of appropriate parameters.

Download Full-text

EEG-Based Emotion Recognition with Deep Convolution Neural Network

2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS) ◽

10.1109/ddcls.2019.8908880 ◽

2019 ◽

Author(s):

Hui-Min Shao ◽

Jian-Guo Wang ◽

Yu Wang ◽

Yuan Yao ◽

Junjiang Liu

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Convolution Neural Network ◽

Deep Convolution Neural Network

Download Full-text

Quantification of Mental Workload Using a Cascaded Deep One-dimensional Convolution Neural Network and Bi-directional Long Short-Term Memory Model

10.36227/techrxiv.15066642 ◽

2021 ◽

Author(s):

Vipul Sharma ◽

Mitul Kumar Ahirwal

Keyword(s):

Neural Network ◽

Deep Learning ◽

Short Term Memory ◽

Mental Workload ◽

Binary Classification ◽

Convolution Neural Network ◽

Short Term ◽

Term Memory ◽

One Dimensional ◽

Long Short Term Memory

In this paper, a new cascade one-dimensional convolution neural network (1DCNN) and bidirectional long short-term memory (BLSTM) model has been developed for binary and ternary classification of mental workload (MWL). MWL assessment is important to increase the safety and efficiency in Brain-Computer Interface (BCI) systems and professions where multi-tasking is required. Keeping in mind the necessity of MWL assessment, a two-fold study is presented, firstly binary classification is done to classify MWL into Low and High classes. Secondly, ternary classification is applied to classify MWL into Low, Moderate, and High classes. The cascaded 1DCNN-BLSTM deep learning architecture has been developed and tested over the Simultaneous task EEG workload (STEW) dataset. Unlike recent research in MWL, handcrafted feature extraction and engineering are not done, rather end-to-end deep learning is used over 14 channel EEG signals for classification. Accuracies exceeding the previous state-of-the-art studies have been obtained. In binary and ternary classification accuracies of 96.77% and 95.36% have been achieved with 7-fold cross validation, respectively.

Download Full-text