scholarly journals 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Entropy ◽  
2019 ◽  
Vol 21 (5) ◽  
pp. 479 ◽  
Author(s):  
Noushin Hajarolasvadi ◽  
Hasan Demirel

Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.

Author(s):  
Pavitra Patel ◽  
A. A. Chaudhari ◽  
M. A. Pund ◽  
D. H. Deshmukh

<p>Speech emotion recognition is an important issue which affects the human machine interaction. Automatic recognition of human emotion in speech aims at recognizing the underlying emotional state of a speaker from the speech signal. Gaussian mixture models (GMMs) and the minimum error rate classifier (i.e. Bayesian optimal classifier) are popular and effective tools for speech emotion recognition. Typically, GMMs are used to model the class-conditional distributions of acoustic features and their parameters are estimated by the expectation maximization (EM) algorithm based on a training data set. In this paper, we introduce a boosting algorithm for reliably and accurately estimating the class-conditional GMMs. The resulting algorithm is named the Boosted-GMM algorithm. Our speech emotion recognition experiments show that the emotion recognition rates are effectively and significantly boosted by the Boosted-GMM algorithm as compared to the EM-GMM algorithm.<br />During this interaction, human beings have some feelings that they want to convey to their communication partner with whom they are communicating, and then their communication partner may be the human or machine. This work dependent on the emotion recognition of the human beings from their speech signal<br />Emotion recognition from the speaker’s speech is very difficult because of the following reasons: Because of the existence of the different sentences, speakers, speaking styles, speaking rates accosting variability was introduced. The same utterance may show different emotions. Therefore it is very difficult to differentiate these portions of utterance. Another problem is that emotion expression is depending on the speaker and his or her culture and environment. As the culture and environment gets change the speaking style also gets change, which is another challenge in front of the speech emotion recognition system.</p>


Sensors ◽  
2020 ◽  
Vol 20 (18) ◽  
pp. 5212 ◽  
Author(s):  
Tursunov Anvarjon ◽  
Mustaqeem ◽  
Soonil Kwon

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.


2020 ◽  
Vol 17 (1) ◽  
pp. 189-194 ◽  
Author(s):  
S. S. Poorna ◽  
S. Devika Nair ◽  
Arsha Narayan ◽  
Veena Prasad ◽  
Parsi Sai Himaja ◽  
...  

A multimodal emotion recognition system is proposed using speech and facial images. For this purpose a video database is developed, containing emotions in three affective states viz. anger, sad and happiness. The audio and the snapshots of facial expressions acquired from the videos constituted the bimodal input for recognizing emotions. The spoken sentences in the database included text dependent as well as text independent sentences in Malayalam language. The audio features included short-time processing of speech to obtain: energy, zero crossing count, pitch and Mel Frequency Cepstral Coefficients. For facial expressions, the landmark features of face: eyebrows, eyes and mouth, obtained using Viola Jones Algorithm is used. The supervised learning methods K-Nearest Neighbor and Artificial Neural Network are used for emotion analysis. The system performance is evaluated for 3 cases viz. using audio based features and facial features separately and for both features taken together. Further, the effect of text dependent and text independent audio is also analyzed. The result of the analysis shows that text independent videos (utilizing both modalities) using K-Nearest Neighbor (highest accuracy 82.78%) is found to be more effective in recognizing emotions from the database considered.


Sensors ◽  
2021 ◽  
Vol 21 (12) ◽  
pp. 4233
Author(s):  
Bogdan Mocanu ◽  
Ruxandra Tapu ◽  
Titus Zaharia

Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.


Computation ◽  
2017 ◽  
Vol 5 (4) ◽  
pp. 26 ◽  
Author(s):  
Michalis Papakostas ◽  
Evaggelos Spyrou ◽  
Theodoros Giannakopoulos ◽  
Giorgos Siantikos ◽  
Dimitrios Sgouropoulos ◽  
...  

2013 ◽  
Vol 25 (12) ◽  
pp. 3294-3317 ◽  
Author(s):  
Lijiang Chen ◽  
Xia Mao ◽  
Pengfei Wei ◽  
Angelo Compare

This study proposes two classes of speech emotional features extracted from electroglottography (EGG) and speech signal. The power-law distribution coefficients (PLDC) of voiced segments duration, pitch rise duration, and pitch down duration are obtained to reflect the information of vocal folds excitation. The real discrete cosine transform coefficients of the normalized spectrum of EGG and speech signal are calculated to reflect the information of vocal tract modulation. Two experiments are carried out. One is of proposed features and traditional features based on sequential forward floating search and sequential backward floating search. The other is the comparative emotion recognition based on support vector machine. The results show that proposed features are better than those commonly used in the case of speaker-independent and content-independent speech emotion recognition.


Sign in / Sign up

Export Citation Format

Share Document