Listeners’ preference for computer-synthesized speech over natural speech of people with disabilities.

2014 ◽  
Vol 59 (3) ◽  
pp. 289-297 ◽  
Author(s):  
Steven E. Stern ◽  
Chelsea M. Chobany ◽  
Disha V. Patel ◽  
Justin J. Tressler
2020 ◽  
Vol 62 (2) ◽  
pp. 7-17
Author(s):  
Karolina Jankowska ◽  
Tomasz Kuczmarski ◽  
Grażyna Demenko

Abstract The matter of shadowing natural speech has been discussed in many studies and papers. However, there is very little knowledge of human phonetical convergence to synthesized speech. To find out more about this issue an experiment in the Polish language was conducted. Two types of stimuli were used – natural speech and synthesised speech. Five sets of sentences with various phonetic phenomena in Polish were prepared. A group of twenty persons were recorded which gave the total number of 100 samples for each phenomenon. The summary of results shows convergence in both natural and synthesised speech in set number 1, 2, 4 while in group 3 and 5 the convergence was not observed. The baseline production shown that the great majority of participants prefer ɛn/ɛm version of phonetic feature which was reflected in 83 out of 100 sentences. In the shadowing natural speech participants changed ɛn/ɛm to ɛw/ɛ̃ in 26 cases and in 4 ɛw/ɛ̃ to ɛn/ɛm. When shadowing synthesised speech shift from ɛn/ɛm to ɛw/ɛ̃ in 18 sentences and 4 from ɛw/ɛ̃ to ɛn/ɛm. The intonation convergence was also observed in the perceptual analysis, however the analysis of F0 statistics did not show statistically significant differences.


1987 ◽  
Vol 30 (3) ◽  
pp. 425-431 ◽  
Author(s):  
Julia Hoover ◽  
Joe Reichle ◽  
Dianne Van Tasell ◽  
David Cole

The intelligibility of two speech synthesizers [ECHO II (Street Electronics, 1982) and VOTRAX (VOTRAX Division, 1981)] was compared to the intelligibility of natural speech in each of three different contextual conditions: (a) single words, (b)"low-probability sentences" in which the last word could not be predicted from preceding context, and (c) "high-probability sentences" in which the last word could be predicted from preceding context. Additionally, the effect of practice on performance in each condition was examined. Natural speech was more intelligible than either type of synthesized speech regardless of word/sentence condition. In both sentence conditions, VOTRAX speech was significantly more intelligible than ECHO II speech. No practice effect was observed for VOTRAX, while an ascending linear trend occurred for ECHO II. Implications for the use of inexpensive speech synthesis units as components of augmentative communication aids for persons with severe speech and/or language impairments are discussed.


2002 ◽  
Vol 45 (4) ◽  
pp. 802-810 ◽  
Author(s):  
Mary E. Reynolds ◽  
Charlene Isaacs-Duvall ◽  
Michelle Lynn Haddox

This study examined the effect of listening practice on the ability of young adults to comprehend natural speech and DECtalk synthesized speech by having them perform a sentence verification task over a 5-day period. Results showed that response latencies of participants shortened in a similar fashion to sentences presented in both types of speech across the 5-day period, with latencies remaining significantly longer in response to DECtalk than to natural speech across the days. These results suggest that high-quality synthesized speech, such as DECtalk, can be useful in many human factors applications.


1988 ◽  
Vol 19 (4) ◽  
pp. 401-409 ◽  
Author(s):  
Holly J. Massey

The Token Test for Children was given in a synthesized-speech version and a natural-speech version to 11 language-impaired children aged 8 years, 9 months to 10 years, 1 month and to 11 control subjects matched for age and sex. The scores of the language-impaired children on the synthesized version were significantly lower than (a) the synthesized-speech scores of the control group and (b) their own scores on the natural-speech version. Task complexity was a significant factor for the experimental group. Language-impaired children may have difficulty understanding some synthesized voice commands.


Author(s):  
H. S. Venkatagiri

Speech generating devices (SGDs) – both dedicated devices as well as general purpose computers with suitable hardware and software – are important to children and adults who might otherwise not be able to communicate adequately through speech. These devices generate speech in one of two ways: they play back speech that was recorded previously (digitized speech) or synthesize speech from text (text-to-speech or TTS synthesis). This chapter places digitized and synthesized speech within the broader domain of digital speech technology. The technical requirements for digitized and synthesized speech are discussed along with recent advances in improving the accuracy, intelligibility, and naturalness of synthesized speech. The factors to consider in selecting digitized and synthesized speech for augmenting expressive communication abilities in people with disabilities are also discussed. Finally, the research needs in synthesized speech are identified.


Author(s):  
Phung Trung Nghia ◽  
Nguyen Van Tao ◽  
Pham Thi Mai Huong ◽  
Nguyen Thi Bich Diep ◽  
Phung Thi Thu Hien

The articulators typically move smoothly during speech production. Therefore, speech features of natural speech are generally smooth. However, over-smooth causes the “muffleness" and the reduction in identification emotions / expressions / styles in synthesized speech that can affect to the perception of the naturalness in synthesized speech. In the literature, statistical variances of static spectral features have been used as a measure of smoothness in synthesized speech but they are not sufficient enough. This paper aims to propose a speech smoothness measure that can be efficiently applied to evaluate the smoothness of synthesized speech. Experiments show that the proposed measures are reliable and efficient to measure smoothness of different kinds of synthesized speech.


2011 ◽  
Vol 97 (5) ◽  
pp. 852-868 ◽  
Author(s):  
Peter Počta ◽  
Jan Holub

This paper investigates the impact of independent and dependent losses and coding on speech quality predictions provided by PESQ (also known as ITU-T P.862) and P.563 models, when both naturally-produced and synthesized speech are used. Two synthesized speech samples generated with two different Text-to-Speech systems and one naturally-produced sample are investigated. In addition, we assess the variability of PESQ's and P.563's predictions with respect to the type of speech used (naturally-produced or synthesized) and loss conditions as well as their accuracy, by comparing the predictions with subjective assessments. The results show that there is no difference between the impact of packet loss on naturally-produced speech and synthesized speech. On the other hand, the impact of coding is different for the two types of stimuli. In addition, synthesized speech seems to be insensitive to degradations provided by most of the codecs investigated here. The reasons for those findings are particularly discussed. Finally, it is concluded that both models are capable of predicting the quality of transmitted synthesized speech under the investigated conditions to a certain degree. As expected, PESQ achieves the best performance over almost all of the investigated conditions.


Author(s):  
Melissa A. Pierce

In countries other than the United States, the study and practice of speech-language pathology is little known or nonexistent. Recognition of professionals in the field is minimal. Speech-language pathologists in countries where speech-language pathology is a widely recognized and respected profession often seek to share their expertise in places where little support is available for individuals with communication disorders. The Peace Corps offers a unique, long-term volunteer opportunity to people with a variety of backgrounds, including speech-language pathologists. Though Peace Corps programs do not specifically focus on speech-language pathology, many are easily adapted to the profession because they support populations of people with disabilities. This article describes how the needs of local children with communication disorders are readily addressed by a Special Education Peace Corps volunteer.


1983 ◽  
Vol 26 (4) ◽  
pp. 516-524 ◽  
Author(s):  
Donald J. Sharf ◽  
Ralph N. Ohde

Adult and Child manifolds were generated by synthesizing 5 X 5 matrices of/Cej/ type utterances in which F2 and F3 frequencies were systematically varied. Manifold stimuli were presented to 11 graduate-level speech-language pathology students in two conditions: (a) a rating condition in which stimuli were rated on a 4-point scale between good /r/and good /w/; and (b) a labeling condition in which stimuli were labeled as "R," "W," "distorted R." or "N" (for none of the previous choices). It was found that (a) stimuli with low F2 and high F3 frequencies were rated 1.0nmdas;1.4; those with high F2 and low F3 frequencies were rated 3.6–4.0, and those with intermediate values were rated 1.5–3.5; (b) stimuli rated 1.0–1.4 were labeled as "W" and stimuli rated 3.6–4.0 were labeled as "R"; (c) none of the Child manifold stimuli were labeled as distorted "R" and one of the Adult manifold stimuli approached a level of identification that approached the percentage of identification for "R" and "W": and (d) rating and labeling tasks were performed with a high degree of reliability.


Sign in / Sign up

Export Citation Format

Share Document