Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis

Author(s):  
Jungbae Park ◽  
Kijong Han ◽  
Yuneui Jeong ◽  
Sang Wan Lee
Author(s):  
Marvin Coto-Jiménez ◽  
John Goddard-Close

Recent developments in speech synthesis have produced systems capable of producing speech which closely resembles natural speech, and researchers now strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents. Speech synthesis based on Hidden Markov Models (HMM) is of great interest to researchers, due to its ability to produce sophisticated features with a small footprint. Despite some progress, its quality has not yet reached the level of the current predominant unit-selection approaches, which select and concatenate recordings of real speech, and work has been conducted to try to improve HMM-based systems. In this paper, we present an application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis. Our motivation stems from a similar desire to obtain characteristics which are closer to those of natural speech. The paper analyzes four types of postfilters obtained using five voices, which range from a single postfilter to enhance all the parameters, to a multi-stream proposal which separately enhances groups of parameters. The different proposals are evaluated using three objective measures and are statistically compared to determine any significance between them. The results described in the paper indicate that HMM-based voices can be enhanced using this approach, specially for the multi-stream postfilters on the considered objective measures.


1992 ◽  
Vol 86 (10) ◽  
pp. 426-428 ◽  
Author(s):  
E. Hjelmquist ◽  
U. Dahlstrand; ◽  
L. Hedelin

Three groups of visually impaired persons (two middle aged and one old) were investigated with respect to memory and understanding of texts presented with speech synthesis and natural speech, respectively. The results showed that speech synthesis generally yielded lower results than did natural speech. Experience had no effect on performance, and there were only marginal effects related to age. However, there were big differences among the groups with respect to the presentation speed chosen in the speech-synthesis condition.


1987 ◽  
Vol 30 (3) ◽  
pp. 425-431 ◽  
Author(s):  
Julia Hoover ◽  
Joe Reichle ◽  
Dianne Van Tasell ◽  
David Cole

The intelligibility of two speech synthesizers [ECHO II (Street Electronics, 1982) and VOTRAX (VOTRAX Division, 1981)] was compared to the intelligibility of natural speech in each of three different contextual conditions: (a) single words, (b)"low-probability sentences" in which the last word could not be predicted from preceding context, and (c) "high-probability sentences" in which the last word could be predicted from preceding context. Additionally, the effect of practice on performance in each condition was examined. Natural speech was more intelligible than either type of synthesized speech regardless of word/sentence condition. In both sentence conditions, VOTRAX speech was significantly more intelligible than ECHO II speech. No practice effect was observed for VOTRAX, while an ascending linear trend occurred for ECHO II. Implications for the use of inexpensive speech synthesis units as components of augmentative communication aids for persons with severe speech and/or language impairments are discussed.


2016 ◽  
Vol 24 (6) ◽  
pp. 1052-1065
Author(s):  
Yan-You Chen ◽  
Chung-Hsien Wu ◽  
Yi-Chin Huang ◽  
Shih-Lun Lin ◽  
Jhing-Fa Wang

Sign in / Sign up

Export Citation Format

Share Document