Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing

Author(s):  
Mayank Sharma ◽  
Yogesh Virkar ◽  
Marcello Federico ◽  
Roberto Barra-Chicote ◽  
Robert Enyedi
1981 ◽  
Vol 46 (4) ◽  
pp. 398-404 ◽  
Author(s):  
Kathryn M. Yorkston ◽  
David R. Beukelman

Treatment programs of four improving ataxic dysarthric speakers are reviewed. Treatment sequences were based on two overall measures of speech performance—intelligibility and prosody. Increases in intelligibility were initially achieved by control of speaking rate. A hierarchy of rate control strategies, ranging from a rigid imposition of rate through rhythmic cueing to self-monitored rate control is discussed. As speakers improved their monitoring skills, a compromise was made between intelligibility and rate. Normal prosodic patterns were not achieved by the ataxic speakers due to difficulty in precisely coordinating the subtle fundamental frequency, loudness and timing adjustments needed to signal stress. Three of the four subjects were taught to use only durational adjustments to signal stress. In this way, they were able to achieve stress on targeted words consistently and minimize bizarreness which resulted from sweeping changes in fundamental frequency and bursts of loudness. The need for further clinically oriented research is discussed.


1989 ◽  
Vol 32 (2) ◽  
pp. 419-431 ◽  
Author(s):  
Roger J. Ingham ◽  
Janis Costello Ingham ◽  
Mark Onslow ◽  
Patrick Finn

Using single-subject experiments with 3 adult stutterers, this study evaluated the effects of instructions to stutterers to rate and modify how natural their speech sounds on experimenters' ratings of speech naturalness, stuttering frequency, and speaking rate. The study also included an investigation of the reliability of stutterers' and listeners' naturalness ratings. The stutterers were partway through a therapy program using prolonged speech or rate control. Results showed the stutterers could modify their speech so that their naturalness ratings increased or decreased. These changes were independent of stuttering or speaking rate. Experimenter ratings of speech naturalness were unchanged in conditions where stutterers judged their speech to sound more natural, but paralleled the stutterers' ratings when they judged their speech to sound more unnatural. An attempt to see if stutterers differed in their ratings of how natural their speech sounded or felt showed differences for one stutterer. Reratings of randomized session recordings by experimenters and independent judges showed that their ratings were highly reliable. When the same randomized session recordings were rerated by the stutterers (1 and 3 months after the experiment), their judgments of changes in their speech naturalness, which were not found in the experimenters' ratings, remained consistent and reliable.


2020 ◽  
Vol 29 (1) ◽  
pp. 168-184 ◽  
Author(s):  
Karen Hux ◽  
Jessica A. Brown ◽  
Sarah Wallace ◽  
Kelly Knollman-Porter ◽  
Anna Saylor ◽  
...  

Purpose Accessing auditory and written material simultaneously benefits people with aphasia; however, the extent of benefit as well as people's preferences and experiences may vary given different auditory presentation rates. This study's purpose was to determine how 3 text-to-speech rates affect comprehension when adults with aphasia access newspaper articles through combined modalities. Secondary aims included exploring time spent reviewing written texts after speech output cessation, rate preference, preference consistency, and participant rationales for preferences. Method Twenty-five adults with aphasia read and listened to passages presented at slow (113 words per minute [wpm]), medium (154 wpm), and fast (200 wpm) rates. Participants answered comprehension questions, selected most and least preferred rates following the 1st and 3rd experimental sessions and after receiving performance feedback, and explained rate preferences and reading and listening strategies. Results Comprehension accuracy did not vary significantly across presentation rates, but reviewing time after cessation of auditory content did. Visual data inspection revealed that, in particular, participants with substantial extra reviewing time took longer given fast than medium or slow presentation. Regardless of exposure amount or receipt of performance feedback, participants most preferred the medium rate and least preferred the fast rate; rationales centered on reading and listening synchronization, benefits to comprehension, and perceived normality of speaking rate. Conclusion As a group, people with aphasia most preferred and were most efficient given a text-to-speech rate around 150 wpm when processing dual modality content; individual differences existed, however, and mandate attention to personal preferences and processing strengths.


Author(s):  
Louisa M. Slowiaczek ◽  
Howard C. Nusbaum

The increased use of voice-response systems has resulted in a greater need for systematic evaluation of the role of segmental and suprasegmental factors in determining the intelligibility of synthesized speech. Two experiments were conducted to examine the effects of pitch contour and speech rate on the perception of synthetic speech. In Experiment 1, subjects transcribed sentences that were either syntactically correct and meaningful or syntactically correct but semantically anomalous. In Experiment 2, subjects transcribed sentences that varied in length and syntactic structure. In both experiments a text-to-speech system generated synthetic speech at either 150 or 250 words/min. Half of the test sentences were generated with a flat pitch (monotone) and half were generated with normally inflected clausal intonation. The results indicate that the identification of words in fluent synthetic speech is influenced by speaking rate, meaning, length, and, to a lesser degree, pitch contour. The results suggest that in many applied situations the perception of the segmental information in the speech signal may be more critical to the intelligibility of synthesized speech than are suprasegmental factors.


2020 ◽  
Vol 63 (1) ◽  
pp. 59-73 ◽  
Author(s):  
Panying Rong

Purpose The purpose of this article was to validate a novel acoustic analysis of oral diadochokinesis (DDK) in assessing bulbar motor involvement in amyotrophic lateral sclerosis (ALS). Method An automated acoustic DDK analysis was developed, which filtered out the voice features and extracted the envelope of the acoustic waveform reflecting the temporal pattern of syllable repetitions during an oral DDK task (i.e., repetitions of /tɑ/ at the maximum rate on 1 breath). Cycle-to-cycle temporal variability (cTV) of envelope fluctuations and syllable repetition rate (sylRate) were derived from the envelope and validated against 2 kinematic measures, which are tongue movement jitter (movJitter) and alternating tongue movement rate (AMR) during the DDK task, in 16 individuals with bulbar ALS and 18 healthy controls. After the validation, cTV, sylRate, movJitter, and AMR, along with an established clinical speech measure, that is, speaking rate (SR), were compared in their ability to (a) differentiate individuals with ALS from healthy controls and (b) detect early-stage bulbar declines in ALS. Results cTV and sylRate were significantly correlated with movJitter and AMR, respectively, across individuals with ALS and healthy controls, confirming the validity of the acoustic DDK analysis in extracting the temporal DDK pattern. Among all the acoustic and kinematic DDK measures, cTV showed the highest diagnostic accuracy (i.e., 0.87) with 80% sensitivity and 94% specificity in differentiating individuals with ALS from healthy controls, which outperformed the SR measure. Moreover, cTV showed a large increase during the early disease stage, which preceded the decline of SR. Conclusions This study provided preliminary validation of a novel automated acoustic DDK analysis in extracting a useful measure, namely, cTV, for early detection of bulbar ALS. This analysis overcame a major barrier in the existing acoustic DDK analysis, which is continuous voicing between syllables that interferes with syllable structures. This approach has potential clinical applications as a novel bulbar assessment.


2012 ◽  
Vol 21 (2) ◽  
pp. 60-71 ◽  
Author(s):  
Ashley Alliano ◽  
Kimberly Herriger ◽  
Anthony D. Koutsoftas ◽  
Theresa E. Bartolotta

Abstract Using the iPad tablet for Augmentative and Alternative Communication (AAC) purposes can facilitate many communicative needs, is cost-effective, and is socially acceptable. Many individuals with communication difficulties can use iPad applications (apps) to augment communication, provide an alternative form of communication, or target receptive and expressive language goals. In this paper, we will review a collection of iPad apps that can be used to address a variety of receptive and expressive communication needs. Based on recommendations from Gosnell, Costello, and Shane (2011), we describe the features of 21 apps that can serve as a reference guide for speech-language pathologists. We systematically identified 21 apps that use symbols only, symbols and text-to-speech, and text-to-speech only. We provide descriptions of the purpose of each app, along with the following feature descriptions: speech settings, representation, display, feedback features, rate enhancement, access, motor competencies, and cost. In this review, we describe these apps and how individuals with complex communication needs can use them for a variety of communication purposes and to target a variety of treatment goals. We present information in a user-friendly table format that clinicians can use as a reference guide.


Sign in / Sign up

Export Citation Format

Share Document