Integrating Rule and Template-Based Approaches to Prosody Generation for Emotional BODO Speech Synthesis

This paper describes the Isarn speech synthesis system, which is a regional dialect spoken in the Northeast of Thailand. In this study, we focus to improve the prosody generation of the system by using the additional context features. In order to develop the system, the speech parameters (Mel-ceptrum and fundamental frequencies of phoneme within different phonetic contexts) were modelled using Hidden Markov Models (HMM). Synthetic speech was generated by converting the input text into context-dependent phonemes. Speech parameters were generated from the trained HMM, according to the context-dependent phonemes, and were then synthesized through a speech vocoder. In this study, systems were trained using three different feature sets: basic contextual features, tonal, and syllable-context features. Objective and subjective tests were conducted to determine the performance of the proposed system. The results indicated that the addition of the syllable-context features significantly improved the naturalness of synthesized speech.

Download Full-text

Text-to-Speech Synthesis

10.1093/oxfordhb/9780199276349.013.0017 ◽

2012 ◽

Cited By ~ 2

Author(s):

Thierry Dutoit ◽

Yannis Stylianou

Keyword(s):

Language Processing ◽

Speech Synthesis ◽

State Of The Art ◽

Digital Signal ◽

Text To Speech ◽

Waveform Generation ◽

Sentence Level ◽

Text To Speech Synthesis ◽

Commercial Applications ◽

Prosody Generation

This article gives an introduction to state-of-the-art text-to-speech (TTS) synthesis systems, showing both the natural language processing and the digital signal processing problems involved. Text-to-speech (TTS) synthesis is the art of designing talking machines. The article begins with brief user-oriented description of a general TTS system and comments on its commercial applications. It then gives a functional diagram of a modern TTS system, highlighting its components. It describes its morphosyntactic module. Furthermore, it examines why sentence-level phonetization cannot be achieved by a sequence of dictionary look-ups, and describes possible implementations of the phonetizer. Finally, the article describes prosody generation, outlining how intonation and duration can approximately be computed from text. Prosody refers to certain properties of the speech signal, which are related to audible changes in pitch, loudness, and syllable length. This article also introduces the two main existing categories of techniques for waveform generation: synthesis by rule and concatenative synthesis.

Download Full-text

Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection

10.21437/interspeech.2020-1411 ◽

2020 ◽

Author(s):

Shubhi Tyagi ◽

Marco Nicolis ◽

Jonas Rohnke ◽

Thomas Drugman ◽

Jaime Lorenzo-Trueba

Keyword(s):

Speech Synthesis ◽

Prosody Generation

Download Full-text

Perceptual relevance of pitch contours of Mandarin tones and its efficacy in prosody generation of speech synthesis

10.21437/interspeech.2007-701 ◽

2007 ◽

Author(s):

Shi-Han Chen ◽

Chih-Chung Kuo

Keyword(s):

Speech Synthesis ◽

Pitch Contours ◽

Prosody Generation ◽

Mandarin Tones

Download Full-text

Prosody generation by integrating rule and template-based approaches for emotional Malay speech synthesis

TENCON 2008 - 2008 IEEE Region 10 Conference ◽

10.1109/tencon.2008.4766654 ◽

2008 ◽

Author(s):

Mumtaz Begum ◽

Raja N. Ainon ◽

Roziati Zainuddin ◽

Zuraidah M. Don ◽

Gerry Knowles

Keyword(s):

Speech Synthesis ◽

Prosody Generation

Download Full-text

Speech synthesis from natural models by hand and by algorithm

PsycEXTRA Dataset ◽

10.1037/e520562012-289 ◽

2009 ◽

Author(s):

Robert E. Remez ◽

Kathryn R. Dubowski ◽

Morgana L. Davids ◽

Emily F. Thomas ◽

Nina Paddu ◽

...

Keyword(s):

Speech Synthesis

Download Full-text

Design of English text-to-speech conversion algorithm based on machine learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189238 ◽

2020 ◽

pp. 1-12

Author(s):

Li Dongmei

Keyword(s):

Machine Learning ◽

Speech Synthesis ◽

Feature Recognition ◽

Learning Algorithm ◽

Morphological Structure ◽

English Text ◽

Text To Speech ◽

Part Of Speech ◽

Modern Computer ◽

Conversion Algorithm

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.

Download Full-text