On cross-dialect and speaker-adaptation of speaking rate-dependent hierarchical prosodic model for a Hakka text-to-speech system

Author(s):  
Chen-Yu Chiang ◽  
Hsiu-Min Yu ◽  
Sin-Horng Chen
1996 ◽  
Vol 49 (3) ◽  
pp. 745-764 ◽  
Author(s):  
Jörgen Pind

Speech segments are highly context-dependent and acoustically variable. One factor that contributes heavily to the variability of speech is speaking rate. Some speech cues are temporal in nature—that is, the distinctions that they signify are defined over time. How can temporal speech cues keep their distinctiveness in the face of extrinsic transformations, such as those wrought by different speaking rates? This issue is explored with respect to the perception, in Icelandic, of Voice Onset Time as a cue for word-initial stop voicing, wordinitial aspiration as a cue for [h], and Voice Offset Time as a cue for pre-aspiration. All the speech cues show rate-dependent perception though to different degrees, with Voice Offset Time being most sensitive to rate changes and Voice Onset Time least sensitive. The differences in the behaviour of these speech cues are related to their different positions in the syllable.


2020 ◽  
Vol 29 (1) ◽  
pp. 168-184 ◽  
Author(s):  
Karen Hux ◽  
Jessica A. Brown ◽  
Sarah Wallace ◽  
Kelly Knollman-Porter ◽  
Anna Saylor ◽  
...  

Purpose Accessing auditory and written material simultaneously benefits people with aphasia; however, the extent of benefit as well as people's preferences and experiences may vary given different auditory presentation rates. This study's purpose was to determine how 3 text-to-speech rates affect comprehension when adults with aphasia access newspaper articles through combined modalities. Secondary aims included exploring time spent reviewing written texts after speech output cessation, rate preference, preference consistency, and participant rationales for preferences. Method Twenty-five adults with aphasia read and listened to passages presented at slow (113 words per minute [wpm]), medium (154 wpm), and fast (200 wpm) rates. Participants answered comprehension questions, selected most and least preferred rates following the 1st and 3rd experimental sessions and after receiving performance feedback, and explained rate preferences and reading and listening strategies. Results Comprehension accuracy did not vary significantly across presentation rates, but reviewing time after cessation of auditory content did. Visual data inspection revealed that, in particular, participants with substantial extra reviewing time took longer given fast than medium or slow presentation. Regardless of exposure amount or receipt of performance feedback, participants most preferred the medium rate and least preferred the fast rate; rationales centered on reading and listening synchronization, benefits to comprehension, and perceived normality of speaking rate. Conclusion As a group, people with aphasia most preferred and were most efficient given a text-to-speech rate around 150 wpm when processing dual modality content; individual differences existed, however, and mandate attention to personal preferences and processing strengths.


Author(s):  
Louisa M. Slowiaczek ◽  
Howard C. Nusbaum

The increased use of voice-response systems has resulted in a greater need for systematic evaluation of the role of segmental and suprasegmental factors in determining the intelligibility of synthesized speech. Two experiments were conducted to examine the effects of pitch contour and speech rate on the perception of synthetic speech. In Experiment 1, subjects transcribed sentences that were either syntactically correct and meaningful or syntactically correct but semantically anomalous. In Experiment 2, subjects transcribed sentences that varied in length and syntactic structure. In both experiments a text-to-speech system generated synthetic speech at either 150 or 250 words/min. Half of the test sentences were generated with a flat pitch (monotone) and half were generated with normally inflected clausal intonation. The results indicate that the identification of words in fluent synthetic speech is influenced by speaking rate, meaning, length, and, to a lesser degree, pitch contour. The results suggest that in many applied situations the perception of the segmental information in the speech signal may be more critical to the intelligibility of synthesized speech than are suprasegmental factors.


2021 ◽  
Author(s):  
Mayank Sharma ◽  
Yogesh Virkar ◽  
Marcello Federico ◽  
Roberto Barra-Chicote ◽  
Robert Enyedi

Sign in / Sign up

Export Citation Format

Share Document