Constructing emotional speech synthesizers with limited speech database

Affective computing is not only the direction of reform in artificial intelligence but also exemplification of the advanced intelligent machines. Emotion is the biggest difference between human and machine. If the machine behaves with emotion, then the machine will be accepted by more people. Voice is the most natural and can be easily understood and accepted manner in daily communication. The recognition of emotional voice is an important field of artificial intelligence. However, in recognition of emotions, there often exists the phenomenon that two emotions are particularly vulnerable to confusion. This article presents a combined cepstral distance method in two-group multi-class emotion classification for emotional speech recognition. Cepstral distance combined with speech energy is well used as speech signal endpoint detection in speech recognition. In this work, the use of cepstral distance aims to measure the similarity between frames in emotional signals and in neutral signals. These features are input for directed acyclic graph support vector machine classification. Finally, a two-group classification strategy is adopted to solve confusion in multi-emotion recognition. In the experiments, Chinese mandarin emotion database is used and a large training set (1134 + 378 utterances) ensures a powerful modelling capability for predicting emotion. The experimental results show that cepstral distance increases the recognition rate of emotion sad and can balance the recognition results with eliminating the over fitting. And for the German corpus Berlin emotional speech database, the recognition rate between sad and boring, which are very difficult to distinguish, is up to 95.45%.

Download Full-text

Emotional Speech Database and the Acoustic Analysis of Emotional Speech

EONEOHAG ◽

10.17290/jlsk.2015..72.175 ◽

2015 ◽

Vol null (72) ◽

pp. 175-199

Author(s):

손남호 ◽

Hwang Hyosung ◽

Ho-Young Lee

Keyword(s):

Acoustic Analysis ◽

Emotional Speech ◽

Speech Database ◽

Emotional Speech Database

Download Full-text

Recognition of speech emotion using custom 2D-convolution neural network deep learning algorithm

Intelligent Data Analysis ◽

10.3233/ida-194747 ◽

2020 ◽

Vol 24 (5) ◽

pp. 1065-1086

Author(s):

Kudakwashe Zvarevashe ◽

Oludayo O. Olugbara

Keyword(s):

Neural Network ◽

Deep Learning ◽

Learning Algorithm ◽

Convolution Neural Network ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Modern World ◽

Emotional Speech ◽

Deep Learning Algorithm ◽

Speech Database

Speech emotion recognition has become the heart of most human computer interaction applications in the modern world. The growing need to develop emotionally intelligent devices has opened up a lot of research opportunities. Most researchers in this field have applied the use of handcrafted features and machine learning techniques in recognising speech emotion. However, these techniques require extra processing steps and handcrafted features are usually not robust. They are computationally intensive because the curse of dimensionality results in low discriminating power. Research has shown that deep learning algorithms are effective for extracting robust and salient features in dataset. In this study, we have developed a custom 2D-convolution neural network that performs both feature extraction and classification of vocal utterances. The neural network has been evaluated against deep multilayer perceptron neural network and deep radial basis function neural network using the Berlin database of emotional speech, Ryerson audio-visual emotional speech database and Surrey audio-visual expressed emotion corpus. The described deep learning algorithm achieves the highest precision, recall and F1-scores when compared to other existing algorithms. It is observed that there may be need to develop customized solutions for different language settings depending on the area of applications.

Download Full-text

Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features

Applied Sciences ◽

10.3390/app9122470 ◽

2019 ◽

Vol 9 (12) ◽

pp. 2470 ◽

Cited By ~ 7

Author(s):

Anvarjon Tursunov ◽

Soonil Kwon ◽

Hee-Suk Pang

Keyword(s):

Short Term Memory ◽

Classification Performance ◽

Support Vector ◽

Emotional Speech ◽

Acoustic Features ◽

Discrete Emotions ◽

Forward Selection ◽

Mel Frequency Cepstral Coefficients ◽

Speech Database ◽

Emotional Speech Database

The most used and well-known acoustic features of a speech signal, the Mel frequency cepstral coefficients (MFCC), cannot characterize emotions in speech sufficiently when a classification is performed to classify both discrete emotions (i.e., anger, happiness, sadness, and neutral) and emotions in valence dimension (positive and negative). The main reason for this is that some of the discrete emotions, such as anger and happiness, share similar acoustic features in the arousal dimension (high and low) but are different in the valence dimension. Timbre is a sound quality that can discriminate between two sounds even with the same pitch and loudness. In this paper, we analyzed timbre acoustic features to improve the classification performance of discrete emotions as well as emotions in the valence dimension. Sequential forward selection (SFS) was used to find the most relevant acoustic features among timbre acoustic features. The experiments were carried out on the Berlin Emotional Speech Database and the Interactive Emotional Dyadic Motion Capture Database. Support vector machine (SVM) and long short-term memory recurrent neural network (LSTM-RNN) were used to classify emotions. The significant classification performance improvements were achieved using a combination of baseline and the most relevant timbre acoustic features, which were found by applying SFS on a classification of emotions for the Berlin Emotional Speech Database. From extensive experiments, it was found that timbre acoustic features could characterize emotions sufficiently in a speech in the valence dimension.

Download Full-text

Creation and Analysis of Emotional Speech Database for Multiple Emotions Recognition

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) ◽

10.1109/o-cocosda50338.2020.9295041 ◽

2020 ◽

Author(s):

Ryota Sato ◽

Ryohei Sasaki ◽

Norisato Suga ◽

Toshihiro Furukawa

Keyword(s):

Emotional Speech ◽

Speech Database ◽

Emotional Speech Database

Download Full-text

Emotional speech feature selection using end-part segmented energy feature

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v15.i3.pp1374-1381 ◽

2019 ◽

Vol 15 (3) ◽

pp. 1374

Author(s):

Noor Aina Zaidan ◽

Md Sah Hj Salam

Keyword(s):

Recognition Accuracy ◽

Recognition Rate ◽

Segment Length ◽

Global Feature ◽

Emotional Speech ◽

Global Features ◽

Emotional Recognition ◽

Speech Database ◽

Segmentation Approach ◽

Energy Feature

The accuracy of human emotional detection is crucial in the industry to ensure effective conversations and messages delivery. The process involved in identifying emotions must be carried out properly and using a method that guarantees high level of emotional recognition. Energy feature is said to be a prosodic information encoder and there are still studies on energy use in speech prosody and it motivate us to run an experiment on energy features. We have conducted two sets of studies: 1) whether local or global features that contribute most to emotional recognition and 2) the effect of the end-part segment length towards emotion recognition accuracy using 2 types of segmentation approach. This paper discussed about Absolute Time Intervals at Relative Positions (ATIR) segmentation approach and global ATIR (GATIR) using end-part segmented global energy feature extracted from Berlin Emotional Speech Database (EMO-DB). We observed that global feature contribute more to the emotional recognition and global features that are derived from longer segments give higher recognition accuracy than global feature derived from short segments. The addition of utterance-based feature (GTI) to ATIR segmentation somewhat contributes to increase the accuracy by 5% up to 8% and conclude that GATIR outperformed ATIR segmentation approached in term of its higher recognition rate. The results of this study where almost all the sub-tests provide an increased result proving that global feature derived from longer segment lengths acquire more emotional information and enhance the system performance.

Download Full-text

Creation of speech corpus for emotion analysis in Gujarati language and its evaluation by various speech parameters

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i5.pp4752-4758 ◽

2020 ◽

Vol 10 (5) ◽

pp. 4752

Author(s):

Vishal P. Tank ◽

S. K. Hadia

Keyword(s):

Artificial Intelligence ◽

Facial Expression ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Emotional States ◽

Emotional Speech ◽

Speech Corpus ◽

Speech Database ◽

Machine Communication ◽

Gujarati Language

In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official’s language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.

Download Full-text

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Data ◽

10.3390/data6120130 ◽

2021 ◽

Vol 6 (12) ◽

pp. 130

Author(s):

Mathilde Marie Duville ◽

Luz María Alonso-Valerdi ◽

David I. Ibarra-Zarate

Keyword(s):

Statistical Analysis ◽

Emotional Expression ◽

Support Vector ◽

Cultural Variation ◽

Emotional Prosody ◽

Emotional Speech ◽

Linguistic Features ◽

Male Adult ◽

Speech Database ◽

Emotional Speech Database

In this paper, the Mexican Emotional Speech Database (MESD) that contains single-word emotional utterances for anger, disgust, fear, happiness, neutral and sadness with adult (male and female) and child voices is described. To validate the emotional prosody of the uttered words, a cubic Support Vector Machines classifier was trained on the basis of prosodic, spectral and voice quality features for each case study: (1) male adult, (2) female adult and (3) child. In addition, cultural, semantic, and linguistic shaping of emotional expression was assessed by statistical analysis. This study was registered at BioMed Central and is part of the implementation of a published study protocol. Mean emotional classification accuracies yielded 93.3%, 89.4% and 83.3% for male, female and child utterances respectively. Statistical analysis emphasized the shaping of emotional prosodies by semantic and linguistic features. A cultural variation in emotional expression was highlighted by comparing the MESD with the INTERFACE for Castilian Spanish database. The MESD provides reliable content for linguistic emotional prosody shaped by the Mexican cultural environment. In order to facilitate further investigations, a corpus controlled for linguistic features and emotional semantics, as well as one containing words repeated across voices and emotions are provided. The MESD is made freely available.

Download Full-text

Constructing emotional speech synthesizers with limited speech database

Kannada Emotional Speech Database: Design, Development and Evaluation

Assessment of spontaneous emotional speech database toward emotion recognition: Intensity and similarity of perceived emotion from spontaneously expressed emotional speech

A combined cepstral distance method for emotional speech recognition

Emotional Speech Database and the Acoustic Analysis of Emotional Speech

Recognition of speech emotion using custom 2D-convolution neural network deep learning algorithm

Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features

Creation and Analysis of Emotional Speech Database for Multiple Emotions Recognition

Emotional speech feature selection using end-part segmented energy feature

Creation of speech corpus for emotion analysis in Gujarati language and its evaluation by various speech parameters

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Export Citation Format