scholarly journals The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach

Author(s):  
Noé Tits ◽  
Kevin El Haddad ◽  
Thierry Dutoit

As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, and psychology. In this chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the chapter intends to assemble the different aspects of the theory and summarize the concepts.

Statistical Parametric Speech Synthesis has been most growing technique rather than the traditional approaches that we are used to synthesizing the speech. The shortcoming of traditional approaches will be overcome with latest statistical techniques. The main advantages of SPSS from traditional synthesis technique are that it has more flexibility to change the characteristics of voice and support more multiple languages i.e. multilingual, has good coverage of acoustic ` and robustness. It generates high quality of speech from small training database. Deep Neural network and Hidden Morkov model are basic statistical parametric speech synthesis techniques. Gaussian mixture model, sinusoidal model are also under this categories. Features were extracted in two type spectral features like spectral bandwidth, spectral centroid etc. and excitation features like F0 frequencies etc. We are using 722 Punjabi phonemes. Using sound forge software we extracted the 200 wave file from 1 hour pre-recording wave file related to those phonemes. Each and every phonemes feature was extracted and saved in database. We were extracting 28 features of each phoneme. TTS text-to-speech system generates sounds or speech as a output when provided the text of Punjabi language. There were already many TTS are developed on different Indian languages. The system that we are trying to build is based only on Punjabi language.


Author(s):  
Beiming Cao ◽  
Myungjong Kim ◽  
Jan van Santen ◽  
Ted Mau ◽  
Jun Wang

2019 ◽  
Author(s):  
Elshadai Tesfaye Biru ◽  
Yishak Tofik Mohammed ◽  
David Tofu ◽  
Erica Cooper ◽  
Julia Hirschberg

Sign in / Sign up

Export Citation Format

Share Document