Modern speech synthesis for phonetic sciences: a discussion and an evaluation

Decades of gradual advances in speech synthesis have recently culminated in exponential improvements fuelled by deep learning. This quantum leap has the potential to finally deliver realistic, controllable, and robust synthetic stimuli for speech experiments. In this article, we discuss these and other implications for phonetic sciences. We substantiate our argument by evaluating classic rule-based formant synthesis against state-of-the-art synthesisers on a) subjective naturalness ratings and b) a behavioural measure (reaction times in a lexical decision task). We also differentiate between text-to-speech and speech-to-speech methods. Naturalness ratings indicate that all modern systems are substantially closer to natural speech than formant synthesis. Reaction times for several modern systems do not differ substantially from natural speech, meaning that the processing gap observed in older systems, and reproduced with our formant synthesiser, is no longer evident. Importantly, some speech-to-speech methods are nearly indistinguishable from natural speech on both measures.

Download Full-text

Integration of rule-based formant synthesis and waveform concatenation: a hybrid approach to text-to-speech synthesis

Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. ◽

10.1109/wss.2002.1224379 ◽

2004 ◽

Cited By ~ 6

Author(s):

S.R. Hertz

Keyword(s):

Speech Synthesis ◽

Hybrid Approach ◽

Text To Speech ◽

Rule Based ◽

Formant Synthesis ◽

Text To Speech Synthesis

Download Full-text

A rule-based phrase parser for real-time text-to-speech synthesis

Natural Language Engineering ◽

10.1017/s1351324900000140 ◽

1995 ◽

Vol 1 (2) ◽

pp. 191-212 ◽

Cited By ~ 1

Author(s):

Joan Bachenko ◽

Eileen Fitzpatrick ◽

Jeffrey Daugherty

Keyword(s):

Real Time ◽

Hard Of Hearing ◽

Speech Synthesis ◽

Break Point ◽

Linguistic Context ◽

Text To Speech ◽

Rule Based ◽

Front End ◽

Text To Speech Synthesis ◽

Break Points

AbstractText-to-speech systems are currently designed to work on complete sentences and paragraphs, thereby allowing front end processors access to large amounts of linguistic context. Problems with this design arise when applications require text to be synthesized in near real time, as it is being typed. How does the system decide which incoming words should be collected and synthesized as a group when prior and subsequent word groups are unknown? We describe a rule-based parser that uses a three cell buffer and phrasing rules to identify break points for incoming text. Words up to the break point are synthesized as new text is moved into the buffer; no hierarchical structure is built beyond the lexical level. The parser was developed for use in a system that synthesizes written telecommunications by Deaf and hard of hearing people. These are texts written entirely in upper case, with little or no punctuation, and using a nonstandard variety of English (e.g. WHEN DO I WILL CALL BACK YOU). The parser performed well in a three month field trial utilizing tens of thousands of texts. Laboratory tests indicate that the parser exhibited a low error rate when compared with a human reader.

Download Full-text

Deep Syntactic Analysis and Rule Based Accentuation in Text-to-Speech Synthesis

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-540-87391-4_68 ◽

2008 ◽

pp. 535-542 ◽

Cited By ~ 1

Author(s):

Antti Suni ◽

Martti Vainio

Keyword(s):

Speech Synthesis ◽

Syntactic Analysis ◽

Text To Speech ◽

Rule Based ◽

Text To Speech Synthesis

Download Full-text

Generating the Voice of the Interactive Virtual Assistant

10.5772/intechopen.95510 ◽

2021 ◽

Author(s):

Adriana Stan ◽

Beáta Lőrincz

Keyword(s):

Speech Synthesis ◽

Text Processing ◽

Research Field ◽

Text To Speech ◽

Rule Based ◽

Acoustic Modelling ◽

Research Problems ◽

Text To Speech Synthesis ◽

Main Components ◽

The Voice

This chapter introduces an overview of the current approaches for generating spoken content using text-to-speech synthesis (TTS) systems, and thus the voice of an Interactive Virtual Assistant (IVA). The overview builds upon the issues which make spoken content generation a non-trivial task, and introduces the two main components of a TTS system: text processing and acoustic modelling. It then focuses on providing the reader with the minimally required scientific details of the terminology and methods involved in speech synthesis, yet with sufficient knowledge so as to be able to make the initial decisions regarding the choice of technology for the vocal identity of the IVA. The speech synthesis methodologies’ description begins with the basic, easy to run, low-requirement rule-based synthesis, and ends up within the state-of-the-art deep learning landscape. To bring this extremely complex and extensive research field closer to commercial deployment, an extensive indexing of the readily and freely available resources and tools required to build a TTS system is provided. Quality evaluation methods and open research problems are, as well, highlighted at end of the chapter.

Download Full-text

Towards designing a high intelligibility rule based standard malay text-to-speech synthesis system

2008 International Conference on Computer and Communication Engineering ◽

10.1109/iccce.2008.4580574 ◽

2008 ◽

Cited By ~ 3

Author(s):

Zakiah Hanim Ahmad ◽

Othman Khalifa

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Synthesis System ◽

Rule Based ◽

Text To Speech Synthesis

Download Full-text

A rule based perceptual intonation model for Turkish text-to-speech synthesis

2012 20th Signal Processing and Communications Applications Conference (SIU) ◽

10.1109/siu.2012.6204475 ◽

2012 ◽

Cited By ~ 1

Author(s):

Ibrahim Baran Uslu ◽

Hakki Gokhan Ilk

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Rule Based ◽

Text To Speech Synthesis ◽

Turkish Text

Download Full-text

A Rule-Based Concatenative Approach to Speech Synthesis in Indian Language Text-to-Speech Systems

Advances in Intelligent Systems and Computing - Intelligent Computing, Communication and Devices ◽

10.1007/978-81-322-2009-1_59 ◽

2014 ◽

pp. 523-531 ◽

Cited By ~ 1

Author(s):

Soumya Priyadarsini Panda ◽

Ajit Kumar Nayak

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Indian Language ◽

Rule Based ◽

Language Text

Download Full-text

Text‐to‐speech synthesis and statistical analysis of natural speech corpora

The Journal of the Acoustical Society of America ◽

10.1121/1.409116 ◽

1994 ◽

Vol 95 (5) ◽

pp. 2948-2948

Author(s):

Jan P. H. van Santen

Keyword(s):

Statistical Analysis ◽

Speech Synthesis ◽

Natural Speech ◽

Text To Speech ◽

Speech Corpora ◽

Text To Speech Synthesis

Download Full-text

Text-to-Speech Synthesis

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.38 ◽

2018 ◽

Author(s):

Thierry Dutoit ◽

Yannis Stylianou

Keyword(s):

Speech Synthesis ◽

Markov Models ◽

Text To Speech ◽

Functional Perspective ◽

Formant Synthesis ◽

Engineering Costs ◽

Text To Speech Synthesis ◽

Major Shift ◽

Learning Architectures ◽

Real Challenge

Text-to-speech (TTS) synthesis is the art of designing talking machines. Seen from this functional perspective, the task looks simple, but this chapter shows that delivering intelligible, natural-sounding, and expressive speech, while also taking into account engineering costs, is a real challenge. Speech synthesis has made a long journey from the big controversy in the 1980s, between MIT’s formant synthesis and Bell Labs’ diphone-based concatenative synthesis. While unit selection technology, which appeared in the mid-1990s, can be seen as an extension of diphone-based approaches, the appearance of Hidden Markov Models (HMM) synthesis around 2005 resulted in a major shift back to models. More recently, the statistical approaches, supported by advanced deep learning architectures, have been shown to advance text analysis and normalization as well as the generation of the waveforms. Important recent milestones have been Google’s Wavenet (September 2016) and the sequence-to-sequence models referred to as Tacotron (I and II).

Download Full-text

A Survey on “Text-to-Speech Systems for Real-Time Audio Synthesis”

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1400 ◽

2021 ◽

pp. 375-379

Author(s):

Prof. Mrunalinee Patole ◽

Akhilesh Pandey ◽

Kaustubh Bhagwat ◽

Mukesh Vaishnav ◽

Salikram Chadar

Keyword(s):

Neural Network ◽

Real Time ◽

Speech Synthesis ◽

State Of The Art ◽

The State ◽

Text To Speech ◽

Convolutional Network ◽

Speech Structures ◽

Bounding Boxes ◽

Textual Content

Text to Speech (TTS) is a form of speech synthesis wherein the text is converted right into a spoken human-like voice output. The state of the art strategies for TTS employs a neural network based totally method. This paintings pursuits to take a look at a number of the problems and barriers gift inside the contemporary works, especially Tacotron-2, and attempts to in addition enhance its performance by means of editing its structure. till now many papers were published on these topics that display various exceptional TTS structures by means of developing new TTS products. The aim is to have a look at different textual content-to-Speech structures. in comparison to different text-to-Speech systems, Tacotron2 has multiple blessings. In opportunity algorithms like CNN, speedy-CNN the algorithmic program may not investigate the photo fully however in YOLO the algorithmic application check out the picture absolutely by predicting the bounding boxes through using convolutional network and possibilities for those packing containers and detects the image faster in comparison to alternative algorithms.

Download Full-text