Combining concatenation and formant synthesis for improved intelligibility and naturalness in text-to-speech systems

Text-to-speech (TTS) synthesis is the art of designing talking machines. Seen from this functional perspective, the task looks simple, but this chapter shows that delivering intelligible, natural-sounding, and expressive speech, while also taking into account engineering costs, is a real challenge. Speech synthesis has made a long journey from the big controversy in the 1980s, between MIT’s formant synthesis and Bell Labs’ diphone-based concatenative synthesis. While unit selection technology, which appeared in the mid-1990s, can be seen as an extension of diphone-based approaches, the appearance of Hidden Markov Models (HMM) synthesis around 2005 resulted in a major shift back to models. More recently, the statistical approaches, supported by advanced deep learning architectures, have been shown to advance text analysis and normalization as well as the generation of the waveforms. Important recent milestones have been Google’s Wavenet (September 2016) and the sequence-to-sequence models referred to as Tacotron (I and II).

Download Full-text

Integration of rule-based formant synthesis and waveform concatenation: a hybrid approach to text-to-speech synthesis

Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. ◽

10.1109/wss.2002.1224379 ◽

2004 ◽

Cited By ~ 6

Author(s):

S.R. Hertz

Keyword(s):

Speech Synthesis ◽

Hybrid Approach ◽

Text To Speech ◽

Rule Based ◽

Formant Synthesis ◽

Text To Speech Synthesis

Download Full-text

Text-To-Speech Formant Synthesis For French

Human Factors and Voice Interactive Systems - Signals and Communication Technology ◽

10.1007/978-0-387-68439-0_13 ◽

2007 ◽

pp. 381-416

Author(s):

Michel Divay ◽

Ed Bruckert

Keyword(s):

Text To Speech ◽

Formant Synthesis

Download Full-text

Modern speech synthesis for phonetic sciences: a discussion and an evaluation

10.31234/osf.io/dxvhc ◽

2020 ◽

Author(s):

Zofia Malisz ◽

Gustav Eje Henter ◽

Cassia Valentini-Botinhao ◽

Oliver Watts ◽

Jonas Beskow ◽

...

Keyword(s):

Speech Synthesis ◽

State Of The Art ◽

Reaction Times ◽

Natural Speech ◽

Decision Task ◽

Synthesis Reaction ◽

Text To Speech ◽

Rule Based ◽

Quantum Leap ◽

Formant Synthesis

Decades of gradual advances in speech synthesis have recently culminated in exponential improvements fuelled by deep learning. This quantum leap has the potential to finally deliver realistic, controllable, and robust synthetic stimuli for speech experiments. In this article, we discuss these and other implications for phonetic sciences. We substantiate our argument by evaluating classic rule-based formant synthesis against state-of-the-art synthesisers on a) subjective naturalness ratings and b) a behavioural measure (reaction times in a lexical decision task). We also differentiate between text-to-speech and speech-to-speech methods. Naturalness ratings indicate that all modern systems are substantially closer to natural speech than formant synthesis. Reaction times for several modern systems do not differ substantially from natural speech, meaning that the processing gap observed in older systems, and reproduced with our formant synthesiser, is no longer evident. Importantly, some speech-to-speech methods are nearly indistinguishable from natural speech on both measures.

Download Full-text

Text to speech synthesizer-formant synthesis

2017 International Conference on Nascent Technologies in Engineering (ICNTE) ◽

10.1109/icnte.2017.7947945 ◽

2017 ◽

Cited By ~ 2

Author(s):

Sneha Lukose ◽

Savitha S. Upadhya

Keyword(s):

Text To Speech ◽

Speech Synthesizer ◽

Formant Synthesis

Download Full-text

A Review of 21 iPad Applications for Augmentative and Alternative Communication Purposes

Perspectives on Augmentative and Alternative Communication ◽

10.1044/aac21.2.60 ◽

2012 ◽

Vol 21 (2) ◽

pp. 60-71 ◽

Cited By ~ 24

Author(s):

Ashley Alliano ◽

Kimberly Herriger ◽

Anthony D. Koutsoftas ◽

Theresa E. Bartolotta

Keyword(s):

Augmentative And Alternative Communication ◽

Cost Effective ◽

Alternative Form ◽

Alternative Communication ◽

Text To Speech ◽

Reference Guide ◽

Expressive Communication ◽

Communication Needs ◽

User Friendly ◽

Ipad Applications

Abstract Using the iPad tablet for Augmentative and Alternative Communication (AAC) purposes can facilitate many communicative needs, is cost-effective, and is socially acceptable. Many individuals with communication difficulties can use iPad applications (apps) to augment communication, provide an alternative form of communication, or target receptive and expressive language goals. In this paper, we will review a collection of iPad apps that can be used to address a variety of receptive and expressive communication needs. Based on recommendations from Gosnell, Costello, and Shane (2011), we describe the features of 21 apps that can serve as a reference guide for speech-language pathologists. We systematically identified 21 apps that use symbols only, symbols and text-to-speech, and text-to-speech only. We provide descriptions of the purpose of each app, along with the following feature descriptions: speech settings, representation, display, feedback features, rate enhancement, access, motor competencies, and cost. In this review, we describe these apps and how individuals with complex communication needs can use them for a variety of communication purposes and to target a variety of treatment goals. We present information in a user-friendly table format that clinicians can use as a reference guide.

Download Full-text

Design of English text-to-speech conversion algorithm based on machine learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189238 ◽

2020 ◽

pp. 1-12

Author(s):

Li Dongmei

Keyword(s):

Machine Learning ◽

Speech Synthesis ◽

Feature Recognition ◽

Learning Algorithm ◽

Morphological Structure ◽

English Text ◽

Text To Speech ◽

Part Of Speech ◽

Modern Computer ◽

Conversion Algorithm

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.

Download Full-text