Voice and Speech Synthesis—Highlighting the Control of Prosody

The Oxford Handbook of Voice Perception ◽

10.1093/oxfordhb/9780198743187.013.35 ◽

2018 ◽

pp. 756-776

Author(s):

Keikichi Hirose

Keyword(s):

Speech Synthesis ◽

Process Model ◽

Voice Conversion ◽

Synthetic Speech ◽

Synthesis Process ◽

Generation Process ◽

Sound Generation ◽

Prosodic Features ◽

Speech Corpus ◽

Parametric Speech Synthesis

After starting as an effort to mimic the human process of speech sound generation, the quality of synthetic speech has reached a level that makes it difficult to notice that it is synthetic. This owes to the development of waveform concatenation methods which select the most appropriate speech segments from a huge speech corpus. Although the lack of flexibility in producing various speech qualities/styles has been pointed out, this problem is about to be solved by introducing statistical frameworks into parametric speech synthesis. Now, a speaker can even speak a foreign language in his/her voice using advanced voice-conversion techniques. However, if we consider prosodic features of speech, current technologies are not appropriate to handle their hierarchical structure over a long time span. Introduction of prosody modelling into the speech-synthesis process is necessary. In this chapter, after viewing the history of voice/speech synthesis, technologies are explained, starting from text-to-speech and concept-to-speech conversion. Then, methods of sound generation are introduced. Statistical parametric speech synthesis, especially HMM-based speech synthesis, is introduced as a technology that enables flexible speech synthesis—that is, synthetic speech with various qualities/styles requiring a smaller amount of speech corpus. After that, the problem of frame-by-frame processing for prosodic features is addressed and the importance of prosody modelling is pointed out. Prosodic (fundamental frequency) modelling is surveyed and, finally, the generation process model is introduced with some experimental results when applied to HMM-based speech synthesis.

Download Full-text

Selection of Training Data for HMM-based Speech Synthesis from Prosodic Features - Use of Generation Process Model of Fundamental Frequency Contours

10.21437/speechprosody.2014-197 ◽

2014 ◽

Author(s):

Tomoyuki Mizukami ◽

Hiroya Hashimoto ◽

Keikichi Hirose ◽

Daisuke Saito ◽

Nobuaki Minematsu

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Process Model ◽

Training Data ◽

Generation Process ◽

Prosodic Features ◽

Selection Of

Download Full-text

Improvement in corpus-based generation of F0 contours using generation process model for emotional speech synthesis

10.21437/interspeech.2004-332 ◽

2004 ◽

Author(s):

Keikichi Hirose

Keyword(s):

Speech Synthesis ◽

Process Model ◽

Generation Process ◽

Emotional Speech

Download Full-text

Control of fundamental frequency contours using the generation process model in HMM-based speech synthesis

IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS ◽

10.1109/icosp.2010.5656358 ◽

2010 ◽

Cited By ~ 1

Author(s):

Tetsuya Matsuda ◽

Keikichi Hirose ◽

Nobuaki Minematsu

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Process Model ◽

Generation Process

Download Full-text

Speech Synthesis of Emotions Using Vowel Features

International Journal of Software Innovation ◽

10.4018/ijsi.2013010105 ◽

2013 ◽

Vol 1 (1) ◽

pp. 54-67

Author(s):

Kanu Boku ◽

Taro Asada ◽

Yasunari Yoshitomi ◽

Masayoshi Tabuse

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Male Subject ◽

Maximum Amplitude ◽

Synthetic Speech ◽

Emotional Speech ◽

Prosodic Features ◽

Initial Investigation ◽

Synthesis Research ◽

Case Based

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”

Download Full-text

Improved generation of fundamental frequency in HMM-based speech synthesis using generation process model

10.21437/interspeech.2010-597 ◽

2010 ◽

Author(s):

Miaomiao Wang ◽

Miaomiao Wen ◽

Keikichi Hirose ◽

Nobuaki Minematsu

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Process Model ◽

Generation Process

Download Full-text

Improved automatic extraction of generation process model commands and its use for generating fundamental frequency contours for training HMM-based speech synthesis

10.21437/interspeech.2012-157 ◽

2012 ◽

Author(s):

Hiroya Hashimoto ◽

Keikichi Hirose ◽

Nobuaki Minematsu

Keyword(s):

Fundamental Frequency ◽

Speech Synthesis ◽

Process Model ◽

Automatic Extraction ◽

Generation Process

Download Full-text

A new method for FO tracking errors fix and generation in HMM-based Mandarin speech synthesis using generation process model

IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS ◽

10.1109/icosp.2010.5656850 ◽

2010 ◽

Author(s):

Miaomiao Wang ◽

Miaomiao Wen ◽

Keikichi Hirose ◽

Nobuaki Minematsu

Keyword(s):

Speech Synthesis ◽

Process Model ◽

New Method ◽

Generation Process ◽

Tracking Errors

Download Full-text

Discriminative Multi-Stream Postfilters Based on Deep Learning for Enhancing Statistical Parametric Speech Synthesis

Biomimetics ◽

10.3390/biomimetics6010012 ◽

2021 ◽

Vol 6 (1) ◽

pp. 12

Author(s):

Marvin Coto-Jiménez

Keyword(s):

Deep Learning ◽

Hidden Markov Models ◽

Speech Synthesis ◽

Short Term Memory ◽

Markov Models ◽

Hidden Markov ◽

Voice Conversion ◽

Statistical Parametric Speech Synthesis ◽

Parametric Speech Synthesis

Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.

Download Full-text