Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

Download Full-text

Adaptive Data Boosting Technique for Robust Personalized Speech Emotion in Emotionally-Imbalanced Small-Sample Environments

Sensors ◽

10.3390/s18113744 ◽

2018 ◽

Vol 18 (11) ◽

pp. 3744 ◽

Cited By ~ 4

Author(s):

Jaehun Bang ◽

Taeho Hur ◽

Dohyeong Kim ◽

Thien Huynh-The ◽

Jongwon Lee ◽

...

Keyword(s):

Emotion Recognition ◽

Data Augmentation ◽

Imbalanced Data ◽

Sampling Technique ◽

Cold Start ◽

Training Model ◽

Small Sample ◽

Speech Emotion Recognition ◽

Target User ◽

Cold Start Problem

Personalized emotion recognition provides an individual training model for each target user in order to mitigate the accuracy problem when using general training models collected from multiple users. Existing personalized speech emotion recognition research has a cold-start problem that requires a large amount of emotionally-balanced data samples from the target user when creating the personalized training model. Such research is difficult to apply in real environments due to the difficulty of collecting numerous target user speech data with emotionally-balanced label samples. Therefore, we propose the Robust Personalized Emotion Recognition Framework with the Adaptive Data Boosting Algorithm to solve the cold-start problem. The proposed framework incrementally provides a customized training model for the target user by reinforcing the dataset by combining the acquired target user speech with speech from other users, followed by applying SMOTE (Synthetic Minority Over-sampling Technique)-based data augmentation. The proposed method proved to be adaptive across a small number of target user datasets and emotionally-imbalanced data environments through iterative experiments using the IEMOCAP (Interactive Emotional Dyadic Motion Capture) database.

Download Full-text

Important Attributes Selection Based on Rough Set for Speech Emotion Recognition

Transdisciplinary Advancements in Cognitive Mechanisms and Human Information Processing ◽

10.4018/978-1-60960-553-7.ch016 ◽

2011 ◽

pp. 262-271

Author(s):

Jian Zhou ◽

Guoyin Wang ◽

Yong Yang

Keyword(s):

Emotion Recognition ◽

Set Theory ◽

Rough Set ◽

Rough Set Theory ◽

Recognition Rate ◽

Feature Selection Method ◽

Recognition System ◽

Attribute Selection ◽

Computer Application ◽

Speech Emotion Recognition

Speech emotion recognition is becoming more and more important in such computer application fields as health care, children education, etc. In order to improve the prediction performance or providing faster and more cost-effective recognition system, an attribute selection is often carried out beforehand to select the important attributes from the input attribute sets. However, it is time-consuming for traditional feature selection method used in speech emotion recognition to determine an optimum or suboptimum feature subset. Rough set theory offers an alternative, formal and methodology that can be employed to reduce the dimensionality of data. The purpose of this study is to investigate the effectiveness of Rough Set Theory in identifying important features in speech emotion recognition system. The experiments on CLDC emotion speech database clearly show this approach can reduce the calculation cost while retaining a suitable high recognition rate.

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v9i2.22983 ◽

2021 ◽

Vol 9 (2) ◽

pp. 87

Author(s):

Shreya Kumar ◽

Swarnalaxmi Thiruvenkadam

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Prediction Accuracy ◽

Speech Emotion Recognition ◽

German Language ◽

Spectral Contrast ◽

Recognition Systems ◽

The Impact ◽

Contrast Feature

Feature extraction is an integral part in speech emotion recognition. Some emotions become indistinguishable from others due to high resemblance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset and JL corpus dataset were also used to test its performance over different English accents. In addition to that, EmoDB dataset has been used to study its performance in the German language. The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences in arousal levels, and it has been discussed in detail.<div> </div>

Download Full-text

Bag-of-words from image to speech: a multi-classifier emotions recognition system

International Journal of Engineering & Technology ◽

10.14419/ijet.v9i3.30958 ◽

2020 ◽

Vol 9 (3) ◽

pp. 770 ◽

Cited By ~ 1

Author(s):

Mai Ezz-Eldin ◽

Hesham F. A. Hamed ◽

Ashraf A. M. Khalaf

Keyword(s):

Emotion Recognition ◽

Emotional Content ◽

Speech Emotion Recognition ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor ◽

Mel Frequency Cepstral Coefficients ◽

Research Attention ◽

Extreme Gradient Boosting ◽

Bow Technique

Recently, recognizing the emotional content of speech signals has received considerable research attention. Consequently, systems have been developed to recognize the emotional content of a spoken utterance. Achieving high accuracy in speech emotion recognition remains a challenging problem due to issues related to feature extraction, type, and size. Central to this study is increasing emotion recognition accuracy by porting the bag-of-word (BoW) technique from image to speech for feature processing and clustering. The BoW technique is applied to features extracted from Mel frequency cepstral coefficients (MFCC) which enhances feature quality. The study considers deployment of different classification approaches to examine the performance of the embedded BoW approach. The deployed classifiers include support vector machine (SVM), K-nearest neighbor (KNN), naive Bays (NB), random forest (RF), and extreme gradient boosting (XGBoost). In this study, experiments used the standard RAVDESS audio dataset with eight emotions: angry, calm, happy, surprised, sad, disgusted, fearful and neutral. The maximum accuracy obtained in the angry class using SVM was 85%, while overall accuracy was 80.1 %. The empirical works have proved that using BoW achieves better results in terms of accuracy and processing time compared to other available methods.

Download Full-text

Speech Emotion Recognition on Small Sample Learning by Hybrid WGAN-LSTM Networks

Journal of Circuits System and Computers ◽

10.1142/s0218126622500736 ◽

2021 ◽

Author(s):

Cunwei Sun ◽

Luping Ji ◽

Hailing Zhong

Keyword(s):

Emotion Recognition ◽

Language Processing ◽

Short Term Memory ◽

Small Sample ◽

New Method ◽

Small Samples ◽

Speech Emotion Recognition ◽

Generative Adversarial Network ◽

Adversarial Network ◽

In Series

The speech emotion recognition based on the deep networks on small samples is often a very challenging problem in natural language processing. The massive parameters of a deep network are much difficult to be trained reliably on small-quantity speech samples. Aiming at this problem, we propose a new method through the systematical cooperation of Generative Adversarial Network (GAN) and Long Short Term Memory (LSTM). In this method, it utilizes the adversarial training of GAN’s generator and discriminator on speech spectrogram images to implement sufficient sample augmentation. A six-layer convolution neural network (CNN), followed in series by a two-layer LSTM, is designed to extract features from speech spectrograms. For accelerating the training of networks, the parameters of discriminator are transferred to our feature extractor. By the sample augmentation, a well-trained feature extraction network and an efficient classifier could be achieved. The tests and comparisons on two publicly available datasets, i.e., EMO-DB and IEMOCAP, show that our new method is effective, and it is often superior to some state-of-the-art methods.

Download Full-text

A novel feature selection method for speech emotion recognition

Applied Acoustics ◽

10.1016/j.apacoust.2018.11.028 ◽

2019 ◽

Vol 146 ◽

pp. 320-326 ◽

Cited By ~ 28

Author(s):

Turgut Özseven

Keyword(s):

Feature Selection ◽

Emotion Recognition ◽

Feature Selection Method ◽

Selection Method ◽

Speech Emotion Recognition

Download Full-text

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00208-5 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Duowei Tang ◽

Peter Kuppens ◽

Luc Geurts ◽

Toon van Waterschoot

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Speech Signal ◽

State Of The Art ◽

Speech Emotion Recognition ◽

Model Parameters ◽

Proposed Model ◽

End To End ◽

The Impact ◽

Temporal Dependencies

AbstractAmongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

Download Full-text