scholarly journals Speech Emotion Recognition: Humans vs Machines

Discourse ◽  
2019 ◽  
Vol 5 (5) ◽  
pp. 136-152
Author(s):  
S. Werner ◽  
G. N. Petrenko

Introduction. The study focuses on emotional speech perception and speech emotion recognition using prosodic clues alone. Theoretical problems of defining prosody, intonation and emotion along with the challenges of emotion classification are discussed. An overview of acoustic and perceptional correlates of emotions found in speech is provided. Technical approaches to speech emotion recognition are also considered in the light of the latest emotional speech automatic classification experiments.Methodology and sources. The typical “big six” classification commonly used in technical applications is chosen and modified to include such emotions as disgust and shame. A database of emotional speech in Russian is created under sound laboratory conditions. A perception experiment is run using Praat software’s experimental environment.Results and discussion. Cross-cultural emotion recognition possibilities are revealed, as the Finnish and international participants recognised about a half of samples correctly. Nonetheless, native speakers of Russian appear to distinguish a larger proportion of emotions correctly. The effects of foreign languages knowledge, musical training and gender on the performance in the experiment were insufficiently prominent. The most commonly confused pairs of emotions, such as shame and sadness, surprise and fear, anger and disgust as well as confusions with neutral emotion were also given due attention.Conclusion. The work can contribute to psychological studies, clarifying emotion classification and gender aspect of emotionality, linguistic research, providing new evidence for prosodic and comparative language studies, and language technology, deepening the understanding of possible challenges for SER systems.

Author(s):  
Hasrul Mohd Nazid ◽  
Hariharan Muthusamy ◽  
Vikneswaran Vijean ◽  
Sazali Yaacob

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.  


Sensors ◽  
2020 ◽  
Vol 20 (19) ◽  
pp. 5559
Author(s):  
Minji Seo ◽  
Myungho Kim

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.


Sensors ◽  
2020 ◽  
Vol 20 (21) ◽  
pp. 6008 ◽  
Author(s):  
Misbah Farooq ◽  
Fawad Hussain ◽  
Naveed Khan Baloch ◽  
Fawad Riasat Raja ◽  
Heejung Yu ◽  
...  

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.


Author(s):  
Sourabh Suke ◽  
Ganesh Regulwar ◽  
Nikesh Aote ◽  
Pratik Chaudhari ◽  
Rajat Ghatode ◽  
...  

This project describes "VoiEmo- A Speech Emotion Recognizer", a system for recognizing the emotional state of an individual from his/her speech. For example, one's speech becomes loud and fast, with a higher and wider range in pitch, when in a state of fear, anger, or joy whereas human voice is generally slow and low pitched in sadness and tiredness. We have particularly developed a classification model speech emotion detection based on Convolutional neural networks (CNNs), Support Vector Machine (SVM), Multilayer Perceptron (MLP) Classification which make predictions considering the acoustic features of speech signal such as Mel Frequency Cepstral Coefficient (MFCC). Our models have been trained to recognize seven common emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise). For training and testing the model, we have used relevant data from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset and the Toronto Emotional Speech Set (TESS) Dataset. The system is advantageous as it can provide a general idea about the emotional state of the individual based on the acoustic features of the speech irrespective of the language the speaker speaks in, moreover, it also saves time and effort. Speech emotion recognition systems have their applications in various fields like in call centers and BPOs, criminal investigation, psychiatric therapy, the automobile industry, etc.


2019 ◽  
Vol 24 (5) ◽  
Author(s):  
Gintautas Tamulevičius ◽  
Rasa Karbauskaitė ◽  
Gintautas Dzemyda

During the last 10–20 years, a great deal of new ideas have been proposed to improve the accuracy of speech emotion recognition: e.g., effective feature sets, complex classification schemes, and multi-modal data acquisition. Nevertheless, speech emotion recognition is still the task in limited success. Considering the nonlinear and fluctuating nature of the emotional speech, in this paper, we present fractal dimension-based features for speech emotion classification. We employed Katz, Castiglioni, Higuchi, and Hurst exponent-based features and their statistical functionals to establish the 224-dimensional full feature set. The dimension was downsized by applying the Sequential Forward Selection technique. The results of experimental study show a clear superiority of fractal dimension-based feature sets against the acoustic ones. The average accuracy of 96.5% was obtained using the reduced feature sets. The feature selection enabled us to obtain the 4-dimensional and 8-dimensional sets for Lithuanian and German emotions, respectively.


Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.


Author(s):  
Vishal P. Tank ◽  
S. K. Hadia

In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official’s language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.


2020 ◽  
pp. 1-15
Author(s):  
Wang Wei ◽  
Xinyi Cao ◽  
He Li ◽  
Lingjie Shen ◽  
Yaqin Feng ◽  
...  

Abstract To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED features. Support vector machine and convolutional neural network are employed as the classifiers in our experiment. The results show that our proposed method in four tasks of emotion classification all provides significant improvement in unweighted average recall.


2015 ◽  
Vol 781 ◽  
pp. 551-554 ◽  
Author(s):  
Chaidiaw Thiangtham ◽  
Jakkree Srinonchat

Speech Emotion Recognition has widely researched and applied to some appllication such as for communication with robot, E-learning system and emergency call etc.Speech emotion feature extraction is an importance key to achieve the speech emotion recognition which can be classify for personal identity. Speech emotion features are extracted into several coefficients such as Linear Predictive Coefficients (LPCs), Linear Spectral Frequency (LSF), Zero-Crossing (ZC), Mel-Frequency Cepstrum Coefficients (MFCC) [1-6] etc. There are some of research works which have been done in the speech emotion recgnition. A study of zero-crossing with peak-amplitudes in speech emotion classification is introduced in [4]. The results shown that it provides the the technique to extract the emotion feature in time-domain, which still got the problem in amplitude shifting. The emotion recognition from speech is descrpited in [5]. It used the Gaussian Mixture Model (GMM) for extractor of feature speech. The GMM is provided the good results to reduce the back ground noise, howere it still have to focus on random noise in GMM for recognition model. The speech emotion recognition using hidden markov model and support vector machine is explained in [6]. The results shown the average performance of recognition system according to the features of speech emotion still has got the error information. Thus [1-6] provides the recognition performance which still requiers more focus on speech features.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4399
Author(s):  
Youngja Nam ◽  
Chankyu Lee

Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.


Sign in / Sign up

Export Citation Format

Share Document