Effects of Speech Rate and Pitch Contour on the Perception of Synthetic Speech

Author(s):  
Louisa M. Slowiaczek ◽  
Howard C. Nusbaum

The increased use of voice-response systems has resulted in a greater need for systematic evaluation of the role of segmental and suprasegmental factors in determining the intelligibility of synthesized speech. Two experiments were conducted to examine the effects of pitch contour and speech rate on the perception of synthetic speech. In Experiment 1, subjects transcribed sentences that were either syntactically correct and meaningful or syntactically correct but semantically anomalous. In Experiment 2, subjects transcribed sentences that varied in length and syntactic structure. In both experiments a text-to-speech system generated synthetic speech at either 150 or 250 words/min. Half of the test sentences were generated with a flat pitch (monotone) and half were generated with normally inflected clausal intonation. The results indicate that the identification of words in fluent synthetic speech is influenced by speaking rate, meaning, length, and, to a lesser degree, pitch contour. The results suggest that in many applied situations the perception of the segmental information in the speech signal may be more critical to the intelligibility of synthesized speech than are suprasegmental factors.

1992 ◽  
Vol 36 (2) ◽  
pp. 190-192 ◽  
Author(s):  
Janan Al-Awar Smither

This experiment investigated the demands synthetic speech places on short term memory by comparing performance of old and young adults on an ordinary short term memory task. Items presented were generated by a human speaker or by a text-to-speech computer synthesizer. Results were consistent with the idea that the comprehension of synthetic speech imposes increased resource demands on the short term memory system. Older subjects performed significantly more poorly than younger subjects, and both groups performed more poorly with synthetic than with human speech. Findings suggest that short term memory demands imposed by the processing of synthetic speech should be investigated further, particularly regarding the implementation of voice response systems in devices for the elderly.


2020 ◽  
Vol 29 (1) ◽  
pp. 168-184 ◽  
Author(s):  
Karen Hux ◽  
Jessica A. Brown ◽  
Sarah Wallace ◽  
Kelly Knollman-Porter ◽  
Anna Saylor ◽  
...  

Purpose Accessing auditory and written material simultaneously benefits people with aphasia; however, the extent of benefit as well as people's preferences and experiences may vary given different auditory presentation rates. This study's purpose was to determine how 3 text-to-speech rates affect comprehension when adults with aphasia access newspaper articles through combined modalities. Secondary aims included exploring time spent reviewing written texts after speech output cessation, rate preference, preference consistency, and participant rationales for preferences. Method Twenty-five adults with aphasia read and listened to passages presented at slow (113 words per minute [wpm]), medium (154 wpm), and fast (200 wpm) rates. Participants answered comprehension questions, selected most and least preferred rates following the 1st and 3rd experimental sessions and after receiving performance feedback, and explained rate preferences and reading and listening strategies. Results Comprehension accuracy did not vary significantly across presentation rates, but reviewing time after cessation of auditory content did. Visual data inspection revealed that, in particular, participants with substantial extra reviewing time took longer given fast than medium or slow presentation. Regardless of exposure amount or receipt of performance feedback, participants most preferred the medium rate and least preferred the fast rate; rationales centered on reading and listening synchronization, benefits to comprehension, and perceived normality of speaking rate. Conclusion As a group, people with aphasia most preferred and were most efficient given a text-to-speech rate around 150 wpm when processing dual modality content; individual differences existed, however, and mandate attention to personal preferences and processing strengths.


1989 ◽  
Vol 33 ◽  
pp. 89-94
Author(s):  
Hugo Quené

Text-to-speech systems generally consist of two components. The first one converts the input text to an abstract, linguistically relevant, representation. Usually, this is a phoneme representation of the input text, with markers for (word, morpheme, syllable) boundaries, word stress, and sentence accent. The second component converts this transcription into a physical speech sound. Two aspects of natural speech are most important to be imitated in this latter step: (a) natural prosody (speech rate, segment duration, pitch, etc.), and (b) representation of phonetic adjustement between phonemes. The resulting synthetic speech is mainly used in special-purpose applications, although a wider use is foreseen for the future.


1987 ◽  
Vol 31 (9) ◽  
pp. 961-965 ◽  
Author(s):  
Monica A. Merva ◽  
Beverly H. Williges

Two studies were conducted to explore the effects of various parameters on rule-based synthetic speech intelligibility. Experiment I examined the effect of situational context clues and speech rate on synthesized speech intelligibility. Subjects who received pragmatic context information prior to each message had transcription error rates 50% lower than those who received no context information. Speech rates of 250 words per minute (wpm) yielded significantly more transcription errors than rates of 180 wpm. In Experiment II, the effects of speech rate, message repetition, and location of information in a message were examined. Transcription accuracy was best for messages spoken at 150 or 180 wpm and for messages repeated either twice or three times. Words at the end of messages were transcribed more accurately than words at the beginning of messages. Subjective ratings indicated that subjects were aware of errors when incorrectly transcribing a message even though no feedback was provided.


2019 ◽  
Vol 28 (2S) ◽  
pp. 875-886 ◽  
Author(s):  
Jennifer M. Vojtech ◽  
Jacob P. Noordzij ◽  
Gabriel J. Cler ◽  
Cara E. Stepp

Purpose This study investigated how modulating fundamental frequency (f0) and speech rate differentially impact the naturalness, intelligibility, and communication efficiency of synthetic speech. Method Sixteen sentences of varying prosodic content were developed via a speech synthesizer. The f0 contour and speech rate of these sentences were altered to produce 4 stimulus sets: (a) normal rate with a fixed f0 level, (b) slow rate with a fixed f0 level, (c) normal rate with prosodically natural f0 variation, and (d) normal rate with prosodically unnatural f0 variation. Sixteen listeners provided orthographic transcriptions and judgments of naturalness for these stimuli. Results Sentences with f0 variation were rated as more natural than those with a fixed f0 level. Conversely, sentences with a fixed f0 level demonstrated higher intelligibility than those with f0 variation. Speech rate did not affect the intelligibility of stimuli with a fixed f0 level. Communication efficiency was highest for sentences produced at a normal rate and a fixed f0 level. Conclusions Sentence-level f0 variation increased naturalness ratings of synthesized speech, whether the variation was prosodically natural or not. However, these f0 variations reduced intelligibility. There is evidence of a trade-off in naturalness and intelligibility of synthesized speech, which may impact future speech synthesis designs. Supplemental Material https://doi.org/10.23641/asha.8847833


Author(s):  
Eni Maharsi

This paper examines the role of elements of English sentences by employing the approach ofthematic role assignment. The emphasis is on how the positioning of words and phrases insyntactic structure helps determine the roles that the referents of NPs play in the situationdescribed by the sentences. The results reveal that the position of an NP’s determines itsthematic role and. There is a relevance between deep syntactic structure and the assignmentof thematic roles for every NP in the sentence.


2019 ◽  
Vol 40 (6) ◽  
pp. 1421-1454 ◽  
Author(s):  
Tamar Kalandadze ◽  
Valentina Bambini ◽  
Kari-Anne B. Næss

AbstractIndividuals with autism spectrum disorder (ASD) often experience difficulty in comprehending metaphors compared to individuals with typical development (TD). However, there is a large variation in the results across studies, possibly related to the properties of the metaphor tasks. This preregistered systematic review and meta-analysis (a) explored the properties of the metaphor tasks used in ASD research, and (b) investigated the group difference between individuals with ASD and TD on metaphor comprehension, as well as the relationship between the task properties and any between-study variation. A systematic search was undertaken in seven relevant databases. Fourteen studies fulfilled our predetermined inclusion criteria. Across tasks, we detected four types of response format and a great variety of metaphors in terms of familiarity, syntactic structure, and linguistic context. Individuals with TD outperformed individuals with ASD on metaphor comprehension (Hedges’ g = −0.63). Verbal explanation response format was utilized in the study showing the largest effect size in the group comparison. However, due to the sparse experimental manipulations, the role of task properties could not be established. Future studies should consider and report task properties to determine their role in metaphor comprehension, and to inform experimental paradigms as well as educational assessment.


Sign in / Sign up

Export Citation Format

Share Document