scholarly journals A Measure of Smoothness in Synthesized Speech

Author(s):  
Phung Trung Nghia ◽  
Nguyen Van Tao ◽  
Pham Thi Mai Huong ◽  
Nguyen Thi Bich Diep ◽  
Phung Thi Thu Hien

The articulators typically move smoothly during speech production. Therefore, speech features of natural speech are generally smooth. However, over-smooth causes the “muffleness" and the reduction in identification emotions / expressions / styles in synthesized speech that can affect to the perception of the naturalness in synthesized speech. In the literature, statistical variances of static spectral features have been used as a measure of smoothness in synthesized speech but they are not sufficient enough. This paper aims to propose a speech smoothness measure that can be efficiently applied to evaluate the smoothness of synthesized speech. Experiments show that the proposed measures are reliable and efficient to measure smoothness of different kinds of synthesized speech.

2021 ◽  
Vol 7 (1) ◽  
pp. 493-510
Author(s):  
Julien Meyer

Whistled forms of languages are distributed worldwide and survive only in some of the most remote villages on the planet. They are not limited to a given continent, language family, or language structure, but they have been detected only sporadically by researchers and travelers, partly because they can be taken for nonlinguistic phenomena, such as simple signaling. Whistled speech consists of speaking while whistling to communicate at a long distance. The result is a melody that imitates modal speech and that remains intelligible for the interlocutors. This review proposes a typology of this special, little-known, natural speech type and takes socio-environmental and linguistic aspects into consideration. The amazing potential of this phenomenon to provide an alternative point of view into language diversity and speech offers a unique occasion to revisit human language with original insights embracing the adaptive flexibility that characterizes speech production and perception.


2007 ◽  
Vol 1 (2) ◽  
pp. 139-163 ◽  
Author(s):  
Ralf W. Schlosser ◽  
Jeff Sigafoos ◽  
James K. Luiselli ◽  
Katie Angermeier ◽  
Ulana Harasymowyz ◽  
...  

2020 ◽  
Vol 62 (2) ◽  
pp. 7-17
Author(s):  
Karolina Jankowska ◽  
Tomasz Kuczmarski ◽  
Grażyna Demenko

Abstract The matter of shadowing natural speech has been discussed in many studies and papers. However, there is very little knowledge of human phonetical convergence to synthesized speech. To find out more about this issue an experiment in the Polish language was conducted. Two types of stimuli were used – natural speech and synthesised speech. Five sets of sentences with various phonetic phenomena in Polish were prepared. A group of twenty persons were recorded which gave the total number of 100 samples for each phenomenon. The summary of results shows convergence in both natural and synthesised speech in set number 1, 2, 4 while in group 3 and 5 the convergence was not observed. The baseline production shown that the great majority of participants prefer ɛn/ɛm version of phonetic feature which was reflected in 83 out of 100 sentences. In the shadowing natural speech participants changed ɛn/ɛm to ɛw/ɛ̃ in 26 cases and in 4 ɛw/ɛ̃ to ɛn/ɛm. When shadowing synthesised speech shift from ɛn/ɛm to ɛw/ɛ̃ in 18 sentences and 4 from ɛw/ɛ̃ to ɛn/ɛm. The intonation convergence was also observed in the perceptual analysis, however the analysis of F0 statistics did not show statistically significant differences.


2021 ◽  
Vol 12 ◽  
Author(s):  
Simon David Stein ◽  
Ingo Plag

Recent evidence for the influence of morphological structure on the phonetic output goes unexplained by established models of speech production and by theories of the morphology-phonology interaction. Linear discriminative learning (LDL) is a recent computational approach in which such effects can be expected. We predict the acoustic duration of 4,530 English derivative tokens with the morphological functions DIS, NESS, LESS, ATION, and IZE in natural speech data by using predictors derived from a linear discriminative learning network. We find that the network is accurate in learning speech production and comprehension, and that the measures derived from it are successful in predicting duration. For example, words are lengthened when the semantic support of the word's predicted articulatory path is stronger. Importantly, differences between morphological categories emerge naturally from the network, even when no morphological information is provided. The results imply that morphological effects on duration can be explained without postulating theoretical units like the morpheme, and they provide further evidence that LDL is a promising alternative for modeling speech production.


1987 ◽  
Vol 30 (3) ◽  
pp. 425-431 ◽  
Author(s):  
Julia Hoover ◽  
Joe Reichle ◽  
Dianne Van Tasell ◽  
David Cole

The intelligibility of two speech synthesizers [ECHO II (Street Electronics, 1982) and VOTRAX (VOTRAX Division, 1981)] was compared to the intelligibility of natural speech in each of three different contextual conditions: (a) single words, (b)"low-probability sentences" in which the last word could not be predicted from preceding context, and (c) "high-probability sentences" in which the last word could be predicted from preceding context. Additionally, the effect of practice on performance in each condition was examined. Natural speech was more intelligible than either type of synthesized speech regardless of word/sentence condition. In both sentence conditions, VOTRAX speech was significantly more intelligible than ECHO II speech. No practice effect was observed for VOTRAX, while an ascending linear trend occurred for ECHO II. Implications for the use of inexpensive speech synthesis units as components of augmentative communication aids for persons with severe speech and/or language impairments are discussed.


2002 ◽  
Vol 45 (4) ◽  
pp. 802-810 ◽  
Author(s):  
Mary E. Reynolds ◽  
Charlene Isaacs-Duvall ◽  
Michelle Lynn Haddox

This study examined the effect of listening practice on the ability of young adults to comprehend natural speech and DECtalk synthesized speech by having them perform a sentence verification task over a 5-day period. Results showed that response latencies of participants shortened in a similar fashion to sentences presented in both types of speech across the 5-day period, with latencies remaining significantly longer in response to DECtalk than to natural speech across the days. These results suggest that high-quality synthesized speech, such as DECtalk, can be useful in many human factors applications.


NeuroImage ◽  
2017 ◽  
Vol 152 ◽  
pp. 628-638 ◽  
Author(s):  
Anna Maria Alexandrou ◽  
Timo Saarinen ◽  
Sasu Mäkelä ◽  
Jan Kujala ◽  
Riitta Salmelin

2017 ◽  
Vol 56 ◽  
pp. 217-232 ◽  
Author(s):  
Yogesh C.K. ◽  
M. Hariharan ◽  
Ruzelita Ngadiran ◽  
A.H. Adom ◽  
Sazali Yaacob ◽  
...  

1988 ◽  
Vol 19 (4) ◽  
pp. 401-409 ◽  
Author(s):  
Holly J. Massey

The Token Test for Children was given in a synthesized-speech version and a natural-speech version to 11 language-impaired children aged 8 years, 9 months to 10 years, 1 month and to 11 control subjects matched for age and sex. The scores of the language-impaired children on the synthesized version were significantly lower than (a) the synthesized-speech scores of the control group and (b) their own scores on the natural-speech version. Task complexity was a significant factor for the experimental group. Language-impaired children may have difficulty understanding some synthesized voice commands.


Sign in / Sign up

Export Citation Format

Share Document