When Old Meets New: Emotion Recognition from Speech Signals

AbstractSpeech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human–machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.

Download Full-text

Bag-of-words from image to speech: a multi-classifier emotions recognition system

International Journal of Engineering & Technology ◽

10.14419/ijet.v9i3.30958 ◽

2020 ◽

Vol 9 (3) ◽

pp. 770 ◽

Cited By ~ 1

Author(s):

Mai Ezz-Eldin ◽

Hesham F. A. Hamed ◽

Ashraf A. M. Khalaf

Keyword(s):

Emotion Recognition ◽

Emotional Content ◽

Speech Emotion Recognition ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Mel Frequency Cepstral Coefficients ◽

Research Attention ◽

Extreme Gradient Boosting ◽

Bow Technique

Recently, recognizing the emotional content of speech signals has received considerable research attention. Consequently, systems have been developed to recognize the emotional content of a spoken utterance. Achieving high accuracy in speech emotion recognition remains a challenging problem due to issues related to feature extraction, type, and size. Central to this study is increasing emotion recognition accuracy by porting the bag-of-word (BoW) technique from image to speech for feature processing and clustering. The BoW technique is applied to features extracted from Mel frequency cepstral coefficients (MFCC) which enhances feature quality. The study considers deployment of different classification approaches to examine the performance of the embedded BoW approach. The deployed classifiers include support vector machine (SVM), K-nearest neighbor (KNN), naive Bays (NB), random forest (RF), and extreme gradient boosting (XGBoost). In this study, experiments used the standard RAVDESS audio dataset with eight emotions: angry, calm, happy, surprised, sad, disgusted, fearful and neutral. The maximum accuracy obtained in the angry class using SVM was 85%, while overall accuracy was 80.1 %. The empirical works have proved that using BoW achieves better results in terms of accuracy and processing time compared to other available methods.

Download Full-text

Speech emotion recognition using convolutional long short-term memory neural network and support vector machines

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipa.2017.8282315 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nattapong Kurpukdee ◽

Tomoki Koriyama ◽

Takao Kobayashi ◽

Sawit Kasuriya ◽

Chai Wutiwiwatchai ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machines ◽

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Support Vector ◽

Short Term ◽

Term Memory ◽

Vector Machines ◽

Long Short Term Memory

Download Full-text

Adieu recurrence? End-to-end speech emotion recognition using a context stacking dilated convolutional network

2020 28th European Signal Processing Conference (EUSIPCO) ◽

10.23919/eusipco47968.2020.9287667 ◽

2021 ◽

Author(s):

Duowei Tang ◽

Peter Kuppens ◽

Luc Geurts ◽

Toon van Waterschoot

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Convolutional Network ◽

End To End

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text

Speech Emotional Features Extraction Based on Electroglottograph

Neural Computation ◽

10.1162/neco_a_00523 ◽

2013 ◽

Vol 25 (12) ◽

pp. 3294-3317 ◽

Cited By ~ 7

Author(s):

Lijiang Chen ◽

Xia Mao ◽

Pengfei Wei ◽

Angelo Compare

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Vocal Tract ◽

Vocal Folds ◽

Distribution Coefficients ◽

Speech Emotion Recognition ◽

Support Vector ◽

Power Law Distribution ◽

Transform Coefficients ◽

Better Than

This study proposes two classes of speech emotional features extracted from electroglottography (EGG) and speech signal. The power-law distribution coefficients (PLDC) of voiced segments duration, pitch rise duration, and pitch down duration are obtained to reflect the information of vocal folds excitation. The real discrete cosine transform coefficients of the normalized spectrum of EGG and speech signal are calculated to reflect the information of vocal tract modulation. Two experiments are carried out. One is of proposed features and traditional features based on sequential forward floating search and sequential backward floating search. The other is the comparative emotion recognition based on support vector machine. The results show that proposed features are better than those commonly used in the case of speaker-independent and content-independent speech emotion recognition.

Download Full-text

Emotion Recognition in Speech Using with SVM, DSVM and Auto-Encoder

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.37545 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1021-1026

Author(s):

Jeena Augustine

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Emotion Recognition ◽

Support Vector ◽

Mel Frequency Cepstral Coefficients ◽

Classification Rate ◽

Signal Process ◽

Common Technique ◽

Audio Information ◽

Deep Support

Abstract: Emotions recognition from the speech is one of the foremost vital subdomains within the sphere of signal process. during this work, our system may be a two-stage approach, particularly feature extraction, and classification engine. Firstly, 2 sets of options square measure investigated that are: thirty-nine Mel-frequency Cepstral coefficients (MFCC) and sixty-five MFCC options extracted supported the work of [20]. Secondly, we've got a bent to use the Support Vector Machine (SVM) because the most classifier engine since it is the foremost common technique within the sector of speech recognition. Besides that, we've a tendency to research the importance of the recent advances in machine learning along with the deep kerne learning, further because the numerous types of auto-encoders (the basic auto-encoder and also the stacked autoencoder). an oversized set of experiments unit conducted on the SAVEE audio information. The experimental results show that the DSVM technique outperforms the standard SVM with a classification rate of sixty-nine. 84% and 68.25% victimization thirty-nine MFCC, severally. To boot, the auto encoder technique outperforms the standard SVM, yielding a classification rate of 73.01%. Keywords: Emotion recognition, MFCC, SVM, Deep Support Vector Machine, Basic auto-encoder, Stacked Auto encode

Download Full-text

Comparison of Several Acoustic Modeling Techniques for Speech Emotion Recognition

International Journal of Synthetic Emotions ◽

10.4018/ijse.2016010105 ◽

2016 ◽

Vol 7 (1) ◽

pp. 58-68 ◽

Cited By ~ 4

Author(s):

Imen Trabelsi ◽

Med Salim Bouhlel

Keyword(s):

Emotion Recognition ◽

Linear Prediction ◽

Recognition Rate ◽

Gaussian Mixture ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional States ◽

Wide Range ◽

Leibler Divergence ◽

Perceptual Linear Prediction

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.

Download Full-text

Speech Emotion Recognition using MFCC features and LSTM network

2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA) ◽

10.1109/iccubea47591.2019.9129067 ◽

2019 ◽

Cited By ~ 1

Author(s):

Harshawardhan S. Kumbhar ◽

Sheetal U. Bhandari

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Lstm Network

Download Full-text

Ensemble Malware Classification System Using Deep Neural Networks

Electronics ◽

10.3390/electronics9050721 ◽

2020 ◽

Vol 9 (5) ◽

pp. 721 ◽

Cited By ~ 2

Author(s):

Barath Narayanan Narayanan ◽

Venkata Salini Priyamvada Davuluru

Keyword(s):

Neural Networks ◽

Classification System ◽

Short Term Memory ◽

Ensemble Classification ◽

Machine Language ◽

Support Vector ◽

Automated Classification ◽

Malware Classification ◽

Lstm Network ◽

Computational Resources

With the advancement of technology, there is a growing need of classifying malware programs that could potentially harm any computer system and/or smaller devices. In this research, an ensemble classification system comprising convolutional and recurrent neural networks is proposed to distinguish malware programs. Microsoft’s Malware Classification Challenge (BIG 2015) dataset with nine distinct classes is utilized for this study. This dataset contains an assembly file and a compiled file for each malware program. Compiled files are visualized as images and are classified using Convolutional Neural Networks (CNNs). Assembly files consist of machine language opcodes that are distinguished among classes using Long Short-Term Memory (LSTM) networks after converting them into sequences. In addition, features are extracted from these architectures (CNNs and LSTM) and are classified using a support vector machine or logistic regression. An accuracy of 97.2% is achieved using LSTM network for distinguishing assembly files, 99.4% using CNN architecture for classifying compiled files and an overall accuracy of 99.8% using the proposed ensemble approach thereby setting a new benchmark. An independent and automated classification system for assembly and/or compiled files provides the luxury to anti-malware industry experts to choose the type of system depending on their available computational resources.

Download Full-text

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Sensors ◽

10.3390/s20216008 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6008 ◽

Cited By ~ 1

Author(s):

Misbah Farooq ◽

Fawad Hussain ◽

Naveed Khan Baloch ◽

Fawad Riasat Raja ◽

Heejung Yu ◽

...

Keyword(s):

Neural Network ◽

Feature Selection ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Human Machine Interaction ◽

Speaker Independent

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Download Full-text