scholarly journals Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems

1986 ◽  
Vol 18 (2) ◽  
pp. 100-107 ◽  
Author(s):  
Beth G. Greene ◽  
John S. Logan ◽  
David B. Pisoni
Author(s):  
Mahbubur R. Syed ◽  
Shuvro Chakrobartty ◽  
Robert J. Bignall

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.


1992 ◽  
Vol 36 (2) ◽  
pp. 190-192 ◽  
Author(s):  
Janan Al-Awar Smither

This experiment investigated the demands synthetic speech places on short term memory by comparing performance of old and young adults on an ordinary short term memory task. Items presented were generated by a human speaker or by a text-to-speech computer synthesizer. Results were consistent with the idea that the comprehension of synthetic speech imposes increased resource demands on the short term memory system. Older subjects performed significantly more poorly than younger subjects, and both groups performed more poorly with synthetic than with human speech. Findings suggest that short term memory demands imposed by the processing of synthetic speech should be investigated further, particularly regarding the implementation of voice response systems in devices for the elderly.


2009 ◽  
Vol 103 (7) ◽  
pp. 403-414 ◽  
Author(s):  
Konstantinos Papadopoulos ◽  
Athanasios Koutsoklenis ◽  
Evangelia Katemidou ◽  
Areti Okalidou

This study investigated the intelligibility and comprehensibility of natural speech in comparison to synthetic speech. The results demonstrate the type of errors; the relationship between intelligibility and comprehensibility; and the correlation between intelligibility and comprehensibility and key factors, such as the frequency of use of text-to-speech systems.


2020 ◽  
Vol 11 (1) ◽  
pp. 2
Author(s):  
Jiří Přibil ◽  
Anna Přibilová ◽  
Jindřich Matoušek

The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classifier. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The final evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial influence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classification process and on the final evaluation result. The main evaluation experiments confirm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.


Author(s):  
Vo Quang Dieu Ha ◽  
Nguyen Manh Tuan ◽  
Cao Xuan Nam ◽  
Pham Minh Nhut ◽  
Vu Hai Quan

This paper presents a complete specification of the  Vietnamese  speech  synthesis  system  named  VOS (Voice  of  Southern  Vietnam).  Due  to  the  fact  that current  Vietnamese  text-to-speech  systems  lack  the naturalness of output synthetic speech, VOS is based on the  unit  selection  approach  which  aims  to  achieve maximum  naturalness.  There  are  three  main  parts constituting VOS: a corpus manager, a synthesizer, and a  transliteration  model.  Corpus  manager  manages automated  speech  indexing  and  segmentation  for  unit selection  executed  by  the  synthesizer,  while transliteration  model  deals  with  the  pronunciation  of words  in  foreign  languages.  A  comparative experimental  evaluation  of  VnSpeech,  VietVoice,  and VOS  is  conducted  using  ITU-T  P.85  standard.  Results show  that  VOS  outperforms  the  former  two  TTS systems.


1999 ◽  
Vol 29 (1) ◽  
pp. 51-61 ◽  
Author(s):  
Caroline Henton

There is widespread, immediate and enduring demand for high quality, natural, intelligible synthetic female voices in the expanding speech technology industry. Yet synthetic female voices are scarce, both in parametric text-to-speech (TTS) systems and in concatenative ones. Current female synthetic speech largely lacks naturalness, pleasantness and tolerability. Some acoustic specifications of female voices that are relevant to synthesis are discussed in detail. Recent research pertaining to female voice quality is reported and a ranking of these various considerations is proposed. This paper reviews the present situation and considers why there is a paucity of female voice synthesis.


Author(s):  
E. Moulines ◽  
F. Emerard ◽  
D. Larreur ◽  
J.L. Le Saint Milon ◽  
L. Le Faucheur ◽  
...  

1989 ◽  
Vol 33 ◽  
pp. 89-94
Author(s):  
Hugo Quené

Text-to-speech systems generally consist of two components. The first one converts the input text to an abstract, linguistically relevant, representation. Usually, this is a phoneme representation of the input text, with markers for (word, morpheme, syllable) boundaries, word stress, and sentence accent. The second component converts this transcription into a physical speech sound. Two aspects of natural speech are most important to be imitated in this latter step: (a) natural prosody (speech rate, segment duration, pitch, etc.), and (b) representation of phonetic adjustement between phonemes. The resulting synthetic speech is mainly used in special-purpose applications, although a wider use is foreseen for the future.


1989 ◽  
Vol 33 (4) ◽  
pp. 239-241
Author(s):  
Margaret Thomas ◽  
Richard Gilson ◽  
Sharon Ziulkowski ◽  
Stephen Gibbons

The purpose of the present experiment was to investigate the demands placed on the short term memory system by synthetic speech. We compared performance in a typical auditory short term memory task as a function of whether the items were presented by a human voice or by a text-to-speech computer voice generator. Immediate serial recall of digit strings was significantly poorer when presented by synthetic speech than when presented by natural speech. The results are consistent with the idea that comprehension of synthetic speech imposes increased resource demands on the short term memory system.


Sign in / Sign up

Export Citation Format

Share Document