Gender Classification for Emotional Speech using GMFCC and Deep LSTM

We have come to the point that one of the important aspects of the process speech emotion recognition is the gender classification. The correct classification of gender will improve the performance of Speech Emotion Recognition (SER) system towards its robustness. Here, we are specifically referring to Gammatone Mel Frequency Cepstral Coefficient (GMFCC) as a feature extraction method that extracts features from IITKGPSESHC dataset, which is very crucial in deciding either male or female in gender classification. The well known classifier “Deep Long Short Term Memory (Deep LSTM)” is itself an important kind of Recurrent Neural Network (RNN) that handles the longrange dependencies more efficiently than the RNNs. The GMFCC feature has to pass through the Deep LSTM and get average percent gender identification accuracy of 98.3%.

Download Full-text

A EEG-based emotion recognition model with rhythm and time characteristics

Brain Informatics ◽

10.1186/s40708-019-0100-y ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 7

Author(s):

Jianzhuo Yan ◽

Shangbin Chen ◽

Sinuo Deng

Keyword(s):

Emotion Recognition ◽

Time Scales ◽

Time Scale ◽

Short Term Memory ◽

Recognition Model ◽

Memory Network ◽

Long Short Term Memory ◽

Valence And Arousal ◽

Memory Characteristics

Abstract As an advanced function of the human brain, emotion has a significant influence on human studies, works, and other aspects of life. Artificial Intelligence has played an important role in recognizing human emotion correctly. EEG-based emotion recognition (ER), one application of Brain Computer Interface (BCI), is becoming more popular in recent years. However, due to the ambiguity of human emotions and the complexity of EEG signals, the EEG-ER system which can recognize emotions with high accuracy is not easy to achieve. Based on the time scale, this paper chooses the recurrent neural network as the breakthrough point of the screening model. According to the rhythmic characteristics and temporal memory characteristics of EEG, this research proposes a Rhythmic Time EEG Emotion Recognition Model (RT-ERM) based on the valence and arousal of Long–Short-Term Memory Network (LSTM). By applying this model, the classification results of different rhythms and time scales are different. The optimal rhythm and time scale of the RT-ERM model are obtained through the results of the classification accuracy of different rhythms and different time scales. Then, the classification of emotional EEG is carried out by the best time scales corresponding to different rhythms. Finally, by comparing with other existing emotional EEG classification methods, it is found that the rhythm and time scale of the model can contribute to the accuracy of RT-ERM.

Download Full-text

Audio-Textual Emotion Recognition Based on Improved Neural Networks

Mathematical Problems in Engineering ◽

10.1155/2019/2593036 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Linqin Cai ◽

Yaxin Hu ◽

Jiangong Dong ◽

Sitong Zhou

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Recognition System ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Emotional Recognition ◽

Long Short Term Memory

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

Download Full-text

Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database

Electronics ◽

10.3390/electronics9050713 ◽

2020 ◽

Vol 9 (5) ◽

pp. 713 ◽

Cited By ~ 3

Author(s):

Yeonguk Yu ◽

Yoon-Joong Kim

Keyword(s):

Emotion Recognition ◽

Motion Capture ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Main Study ◽

Short Term ◽

Attention Model ◽

Proposed Model ◽

Long Short Term Memory ◽

Spectrogram Feature

We propose a speech-emotion recognition (SER) model with an “attention-long Long Short-Term Memory (LSTM)-attention” component to combine IS09, a commonly used feature for SER, and mel spectrogram, and we analyze the reliability problem of the interactive emotional dyadic motion capture (IEMOCAP) database. The attention mechanism of the model focuses on emotion-related elements of the IS09 and mel spectrogram feature and the emotion-related duration from the time of the feature. Thus, the model extracts emotion information from a given speech signal. The proposed model for the baseline study achieved a weighted accuracy (WA) of 68% for the improvised dataset of IEMOCAP. However, the WA of the proposed model of the main study and modified models could not achieve more than 68% in the improvised dataset. This is because of the reliability limit of the IEMOCAP dataset. A more reliable dataset is required for a more accurate evaluation of the model’s performance. Therefore, in this study, we reconstructed a more reliable dataset based on the labeling results provided by IEMOCAP. The experimental results of the model for the more reliable dataset confirmed a WA of 73%.

Download Full-text

Attention-based convolution skip bidirectional long short-term memory network for speech emotion recognition

IEEE Access ◽

10.1109/access.2020.3047395 ◽

2020 ◽

pp. 1-1

Author(s):

Huiyun Zhang ◽

Heming Huang ◽

Henry Han

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Memory Network ◽

Long Short Term Memory

Download Full-text

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Frontiers in Physiology ◽

10.3389/fphys.2021.643202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hua Zhang ◽

Ruoyun Gou ◽

Jili Shang ◽

Fangyao Shen ◽

Yifan Wu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Convolution Neural Network ◽

Classification Model ◽

Speech Emotion Recognition ◽

Deep Convolution Neural Network ◽

Long Short Term Memory ◽

High Level ◽

Better Than

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Download Full-text

Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

International Journal of Informatics, Information System and Computer Engineering (INJIISCOM) ◽

10.34010/injiiscom.v1i1.4023 ◽

2020 ◽

Vol 1 (1) ◽

Author(s):

Bagus Tris Atmaja ◽

◽

Reda Elbarougy ◽

Masato Akagi ◽

◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Initial Result ◽

Speech Emotion Recognition ◽

Acoustic Features ◽

Dense Networks ◽

Recognition Result ◽

Text Features ◽

Long Short Term Memory ◽

Single Modality

Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks

Download Full-text

Speech Emotion Recognition for Indonesian Language Using Long Short-Term Memory

2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA) ◽

10.1109/ic3ina.2018.8629525 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jeremia Jason Lasiman ◽

Dessi Puji Lestari

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Speech emotion recognition using convolutional long short-term memory neural network and support vector machines

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipa.2017.8282315 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nattapong Kurpukdee ◽

Tomoki Koriyama ◽

Takao Kobayashi ◽

Sawit Kasuriya ◽

Chai Wutiwiwatchai ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machines ◽

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Support Vector ◽

Short Term ◽

Term Memory ◽

Vector Machines ◽

Long Short Term Memory

Download Full-text

Speech Gender Classification Using Bidirectional Long Short Term Memory

2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) ◽

10.1109/isriti51436.2020.9315380 ◽

2020 ◽

Author(s):

Rangga Dwi Alamsyah ◽

Suyanto Suyanto

Keyword(s):

Short Term Memory ◽

Gender Classification ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text