scholarly journals Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment

Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2297
Author(s):  
Zhen-Tao Liu ◽  
Bao-Han Wu ◽  
Dan-Yun Li ◽  
Peng Xiao ◽  
Jun-Wei Mao

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

Sensors ◽  
2018 ◽  
Vol 18 (11) ◽  
pp. 3744 ◽  
Author(s):  
Jaehun Bang ◽  
Taeho Hur ◽  
Dohyeong Kim ◽  
Thien Huynh-The ◽  
Jongwon Lee ◽  
...  

Personalized emotion recognition provides an individual training model for each target user in order to mitigate the accuracy problem when using general training models collected from multiple users. Existing personalized speech emotion recognition research has a cold-start problem that requires a large amount of emotionally-balanced data samples from the target user when creating the personalized training model. Such research is difficult to apply in real environments due to the difficulty of collecting numerous target user speech data with emotionally-balanced label samples. Therefore, we propose the Robust Personalized Emotion Recognition Framework with the Adaptive Data Boosting Algorithm to solve the cold-start problem. The proposed framework incrementally provides a customized training model for the target user by reinforcing the dataset by combining the acquired target user speech with speech from other users, followed by applying SMOTE (Synthetic Minority Over-sampling Technique)-based data augmentation. The proposed method proved to be adaptive across a small number of target user datasets and emotionally-imbalanced data environments through iterative experiments using the IEMOCAP (Interactive Emotional Dyadic Motion Capture) database.


Author(s):  
Jian Zhou ◽  
Guoyin Wang ◽  
Yong Yang

Speech emotion recognition is becoming more and more important in such computer application fields as health care, children education, etc. In order to improve the prediction performance or providing faster and more cost-effective recognition system, an attribute selection is often carried out beforehand to select the important attributes from the input attribute sets. However, it is time-consuming for traditional feature selection method used in speech emotion recognition to determine an optimum or suboptimum feature subset. Rough set theory offers an alternative, formal and methodology that can be employed to reduce the dimensionality of data. The purpose of this study is to investigate the effectiveness of Rough Set Theory in identifying important features in speech emotion recognition system. The experiments on CLDC emotion speech database clearly show this approach can reduce the calculation cost while retaining a suitable high recognition rate.


2020 ◽  
Vol 17 (8) ◽  
pp. 3786-3789
Author(s):  
P. Gayathri ◽  
P. Gowri Priya ◽  
L. Sravani ◽  
Sandra Johnson ◽  
Visanth Sampath

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.


Author(s):  
Shreya Kumar ◽  
Swarnalaxmi Thiruvenkadam

Feature extraction is an integral part in speech emotion recognition. Some emotions become indistinguishable from others due to high resemblance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset and JL corpus dataset were also used to test its performance over different English accents. In addition to that, EmoDB dataset has been used to study its performance in the German language. The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences in arousal levels, and it has been discussed in detail.<div> </div>


2020 ◽  
Vol 9 (3) ◽  
pp. 770 ◽  
Author(s):  
Mai Ezz-Eldin ◽  
Hesham F. A. Hamed ◽  
Ashraf A. M. Khalaf

Recently, recognizing the emotional content of speech signals has received considerable research attention. Consequently, systems have been developed to recognize the emotional content of a spoken utterance. Achieving high accuracy in speech emotion recognition remains a challenging problem due to issues related to feature extraction, type, and size. Central to this study is increasing emotion recognition accuracy by porting the bag-of-word (BoW) technique from image to speech for feature processing and clustering. The BoW technique is applied to features extracted from Mel frequency cepstral coefficients (MFCC) which enhances feature quality. The study considers deployment of different classification approaches to examine the performance of the embedded BoW approach. The deployed classifiers include support vector machine (SVM), K-nearest neighbor (KNN), naive Bays (NB), random forest (RF), and extreme gradient boosting (XGBoost). In this study, experiments used the standard RAVDESS audio dataset with eight emotions: angry, calm, happy, surprised, sad, disgusted, fearful and neutral. The maximum accuracy obtained in the angry class using SVM was 85%, while overall accuracy was 80.1 %. The empirical works have proved that using BoW achieves better results in terms of accuracy and processing time compared to other available methods.


Author(s):  
Cunwei Sun ◽  
Luping Ji ◽  
Hailing Zhong

The speech emotion recognition based on the deep networks on small samples is often a very challenging problem in natural language processing. The massive parameters of a deep network are much difficult to be trained reliably on small-quantity speech samples. Aiming at this problem, we propose a new method through the systematical cooperation of Generative Adversarial Network (GAN) and Long Short Term Memory (LSTM). In this method, it utilizes the adversarial training of GAN’s generator and discriminator on speech spectrogram images to implement sufficient sample augmentation. A six-layer convolution neural network (CNN), followed in series by a two-layer LSTM, is designed to extract features from speech spectrograms. For accelerating the training of networks, the parameters of discriminator are transferred to our feature extractor. By the sample augmentation, a well-trained feature extraction network and an efficient classifier could be achieved. The tests and comparisons on two publicly available datasets, i.e., EMO-DB and IEMOCAP, show that our new method is effective, and it is often superior to some state-of-the-art methods.


Author(s):  
Duowei Tang ◽  
Peter Kuppens ◽  
Luc Geurts ◽  
Toon van Waterschoot

AbstractAmongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 200953-200970
Author(s):  
Arijit Dey ◽  
Soham Chattopadhyay ◽  
Pawan Kumar Singh ◽  
Ali Ahmadian ◽  
Massimiliano Ferrara ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document