Synthesis of Emotional Speech by Prosody Modification of Vowel Segments of Neutral Speech

Author(s):  
Md Shah Fahad ◽  
Shruti Gupta ◽  
Abhinav ◽  
Shreya Singh ◽  
Akshay Deepak

Background: Emotional speech synthesis is the process of synthesising emotions in a neutral speech – potentially generated by a text-to-speech system – to make an artificial human-machine interaction human-like. It typically involves analysis and modification of speech parameters. Existing work on speech synthesis involving modification of prosody parameters does so at sentence, word, and syllable level. However, further fine-grained modification at vowel level has not been explored yet, thereby motivating our work. Objective: To explore prosody parameters at vowel level for emotion synthesis. Method: Our work modifies prosody features (duration, pitch, and intensity) for emotion synthesis. Specifically, it modifies the duration parameter of vowel-like and pause regions and the pitch and intensity parameters of only vowel-like regions. The modification is gender specific using emotional speech templates stored in a database and done using pitch synchronous overlap and add (PSOLA) method. Result: Comparison was done with the existing work on prosody modification at sentence, word and syllable label on IITKGP-SEHSC database. Improvements of 8.14%, 13.56%, and 2.80% for emotions angry, happy, and fear respectively were obtained for the relative mean opinion score. This was due to: (1) prosody modification at vowel-level being more fine-grained than sentence, word, or syllable level and (2) prosody patterns not being generated for consonant regions because vocal cords do not vibrate during consonant production. Conclusion: Our proposed work shows that an emotional speech generated using prosody modification at vowel-level is more convincible than prosody modification at sentence, word and syllable level.

Author(s):  
Ignasi Iriondo ◽  
Santiago Planet ◽  
Francesc Alías ◽  
Joan-Claudi Socoró ◽  
Elisa Martínez

The use of speech in human-machine interaction is increasing as the computer interfaces are becoming more complex but also more useable. These interfaces make use of the information obtained from the user through the analysis of different modalities and show a specific answer by means of different media. The origin of the multimodal systems can be found in its precursor, the “Put-That-There” system (Bolt, 1980), an application operated by speech and gesture recognition. The use of speech as one of these modalities to get orders from users and to provide some oral information makes the human-machine communication more natural. There is a growing number of applications that use speech-to-text conversion and animated characters with speech synthesis. One way to improve the naturalness of these interfaces is the incorporation of the recognition of user’s emotional states (Campbell, 2000). This point generally requires the creation of speech databases showing authentic emotional content allowing robust analysis. Cowie, Douglas-Cowie & Cox (2005) present some databases showing an increase in multimodal databases, and Ververidis & Kotropoulos (2006) describe 64 databases and their application. When creating this kind of databases the main arising problem is the naturalness of the locutions, which directly depends on the method used in the recordings, assuming that they must be controlled without interfering the authenticity of the locutions. Campbell (2000) and Schröder (2004) propose four different sources for obtaining emotional speech, ordered from less control but more authenticity to more control but less authenticity: i) natural occurrences, ii) provocation of authentic emotions in laboratory conditions, iii) stimulated emotions by means of prepared texts, and iv) acted speech reading the same texts with different emotional states, usually performed by actors. On the one hand, corpora designed to synthesize emotional speech are based on studies centred on the listener, following the distinction made by Schröder (2004), because they model the speech parameters in order to transmit a specific emotion. On the other hand, emotion recognition implies studies centred on the speaker, because they are related to the speaker emotional state and the parameters of the speech. The validation of a corpus used for synthesis involves both kinds of studies: the former since it will be used for synthesis and the latter since recognition is needed to evaluate its content. The best validation system is the selection of the valid utterances1 of the corpus by human listeners. However, the big size of a corpus makes this process unaffordable.


Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.


2021 ◽  
pp. 1-9
Author(s):  
Harshadkumar B. Prajapati ◽  
Ankit S. Vyas ◽  
Vipul K. Dabhi

Face expression recognition (FER) has gained very much attraction to researchers in the field of computer vision because of its major usefulness in security, robotics, and HMI (Human-Machine Interaction) systems. We propose a CNN (Convolutional Neural Network) architecture to address FER. To show the effectiveness of the proposed model, we evaluate the performance of the model on JAFFE dataset. We derive a concise CNN architecture to address the issue of expression classification. Objective of various experiments is to achieve convincing performance by reducing computational overhead. The proposed CNN model is very compact as compared to other state-of-the-art models. We could achieve highest accuracy of 97.10% and average accuracy of 90.43% for top 10 best runs without any pre-processing methods applied, which justifies the effectiveness of our model. Furthermore, we have also included visualization of CNN layers to observe the learning of CNN.


Author(s):  
Xiaochen Zhang ◽  
Lanxin Hui ◽  
Linchao Wei ◽  
Fuchuan Song ◽  
Fei Hu

Electric power wheelchairs (EPWs) enhance the mobility capability of the elderly and the disabled, while the human-machine interaction (HMI) determines how well the human intention will be precisely delivered and how human-machine system cooperation will be efficiently conducted. A bibliometric quantitative analysis of 1154 publications related to this research field, published between 1998 and 2020, was conducted. We identified the development status, contributors, hot topics, and potential future research directions of this field. We believe that the combination of intelligence and humanization of an EPW HMI system based on human-machine collaboration is an emerging trend in EPW HMI methodology research. Particular attention should be paid to evaluating the applicability and benefits of the EPW HMI methodology for the users, as well as how much it contributes to society. This study offers researchers a comprehensive understanding of EPW HMI studies in the past 22 years and latest trends from the evolutionary footprints and forward-thinking insights regarding future research.


ATZ worldwide ◽  
2021 ◽  
Vol 123 (3) ◽  
pp. 46-49
Author(s):  
Tobias Hesse ◽  
Michael Oehl ◽  
Uwe Drewitz ◽  
Meike Jipp

Healthcare ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 834
Author(s):  
Magbool Alelyani ◽  
Sultan Alamri ◽  
Mohammed S. Alqahtani ◽  
Alamin Musa ◽  
Hajar Almater ◽  
...  

Artificial intelligence (AI) is a broad, umbrella term that encompasses the theory and development of computer systems able to perform tasks normally requiring human intelligence. The aim of this study is to assess the radiology community’s attitude in Saudi Arabia toward the applications of AI. Methods: Data for this study were collected using electronic questionnaires in 2019 and 2020. The study included a total of 714 participants. Data analysis was performed using SPSS Statistics (version 25). Results: The majority of the participants (61.2%) had read or heard about the role of AI in radiology. We also found that radiologists had statistically different responses and tended to read more about AI compared to all other specialists. In addition, 82% of the participants thought that AI must be included in the curriculum of medical and allied health colleges, and 86% of the participants agreed that AI would be essential in the future. Even though human–machine interaction was considered to be one of the most important skills in the future, 89% of the participants thought that it would never replace radiologists. Conclusion: Because AI plays a vital role in radiology, it is important to ensure that radiologists and radiographers have at least a minimum understanding of the technology. Our finding shows an acceptable level of knowledge regarding AI technology and that AI applications should be included in the curriculum of the medical and health sciences colleges.


Sign in / Sign up

Export Citation Format

Share Document