CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

2022 ◽  
Vol 186 ◽  
pp. 108439
Author(s):  
Jesus Monge Alvarez ◽  
Holly Francois ◽  
Hosang Sung ◽  
Seungdo Choi ◽  
Jonghoon Jeong ◽  
...  
2020 ◽  
Vol 10 (19) ◽  
pp. 6882
Author(s):  
Kostadin Mishev ◽  
Aleksandra Karovska Ristovska ◽  
Dimitar Trajanov ◽  
Tome Eftimov ◽  
Monika Simjanoska

This paper presents MAKEDONKA, the first open-source Macedonian language synthesizer that is based on the Deep Learning approach. The paper provides an overview of the numerous attempts to achieve a human-like reproducible speech, which has unfortunately shown to be unsuccessful due to the work invisibility and lack of integration examples with real software tools. The recent advances in Machine Learning, the Deep Learning-based methodologies, provide novel methods for feature engineering that allow for smooth transitions in the synthesized speech, making it sound natural and human-like. This paper presents a methodology for end-to-end speech synthesis that is based on a fully-convolutional sequence-to-sequence acoustic model with a position-augmented attention mechanism—Deep Voice 3. Our model directly synthesizes Macedonian speech from characters. We created a dataset that contains approximately 20 h of speech from a native Macedonian female speaker, and we use it to train the text-to-speech (TTS) model. The achieved MOS score of 3.93 makes our model appropriate for application in any kind of software that needs text-to-speech service in the Macedonian language. Our TTS platform is publicly available for use and ready for integration.


Author(s):  
Scotty D. Craig ◽  
Erin K. Chiou ◽  
Noah L. Schroeder

The current study investigates if a virtual human’s voice can impact the user’s trust in interacting with the virtual human in a learning setting. It was hypothesized that trust is a malleable factor impacted by the quality of the virtual human’s voice. A randomized alternative treatments design with a pretest placed participants in either a low-quality Text-to-Speech (TTS) engine female voice (Microsoft speech engine), a high-quality TTS engine female voice (Neospeech voice engine), or a human voice (native female English speaker) condition. All three treatments were paired with the same female virtual human. Assessments for the study included a self-report pretest on knowledge of meteorology, which occurred before viewing the instructional video, and a measure of system trust. The current study found that voice type impacts a user’s trust ratings, with the human voice resulting in higher ratings compared to the two synthetic voices.


Language ◽  
1995 ◽  
Vol 71 (2) ◽  
pp. 430
Author(s):  
Helen E. Karn ◽  
Vincent J. van Heuven ◽  
Louis C. W. Pols

Sign in / Sign up

Export Citation Format

Share Document