speech synthesizer
Recently Published Documents


TOTAL DOCUMENTS

277
(FIVE YEARS 17)

H-INDEX

14
(FIVE YEARS 1)

2021 ◽  
Vol 20 (No.4) ◽  
pp. 489-510
Author(s):  
Izzad Ramli ◽  
Nursuriati Jamil ◽  
Noraini Seman

Intonation generation in expressive speech such as storytelling is essential to produce high quality Malay language expressive speech synthesizer. Intonation generation, for instance explicit control, has shown good performance in terms of intelligibility with reasonably natural speech; thus, it was selected in this research. This approach modifies the prosodic features, such as pitch contour, intensity, and duration, to generate the intonation. However, modification of pitch contour remains a problem because the desired pitch contour is not achieved. This paper formulated an improved pitch contour algorithm to develop a modified pitch contour resembling the natural pitch contour. In this work, the syllable pitch contours of nine storytellers were extracted from their storytelling speeches to create an expressive speech syllable dataset called STORY_DATA. All the shapes of pitch contours from STORY_DATA were analyzed and clustered into the standard six main pitch contour clusters for storytelling. The clustering was performed using one minus the Pearson product moment correlation. Then, an improved iterative two-step sinusoidal pitch contour formulation was introduced to modify the pitch contours of a neutral speech into an expressive pitch contour of natural speeches. Overall, the improved pitch contour formulation was able to achieve 93 percent high correlated matches, indicating the high resemblance as compared to the previous pitch contour formulation at 15 percent. Therefore, the improved formula can be used in a text-to-speech (TTS) synthesizer to produce a more natural expressive speech. The paper also discovered unique expressive pitch contours in the Malay language that need further investigations in the future.


Bangla is a useful language to study nasal vowels because all the vowels have their corresponding nasal vowel counterpart. Vowel nasality generation is an important task for artificial nasality production in speech synthesizer. Various methods have been employed by many researchers for generating vowel nasality. Vowel nasality generation for a rule-basedspeech synthesizer has not been studied yet for Bangla. This study discusses several methods using full spectrum and partial spectrum for generating vowel nasality to use in a rule-basedBangla text to speech (TTS) system using demisyllable. In a demisyllable based Bangla TTS 1400 demisyllables are needed to be stored in database. Transforming the vowel part of a demisyllable into its nasal counterpart reduces the speech database size to 700 demisyllables. Comparative study of the e


2021 ◽  
Vol 11 (3) ◽  
pp. 1144
Author(s):  
Sung-Woo Byun ◽  
Seok-Pil Lee

Recently, researchers have developed text-to-speech models based on deep learning, which have produced results superior to those of previous approaches. However, because those systems only mimic the generic speaking style of reference audio, it is difficult to assign user-defined emotional types to synthesized speech. This paper proposes an emotional speech synthesizer constructed by embedding not only speaking styles but also emotional styles. We extend speaker embedding to multi-condition embedding by adding emotional embedding in Tacotron, so that the synthesizer can generate emotional speech. An evaluation of the results showed the superiority of the proposed model to a previous model, in terms of emotional expressiveness.


Electronics ◽  
2020 ◽  
Vol 9 (2) ◽  
pp. 267
Author(s):  
Fernando Alonso Martin ◽  
María Malfaz ◽  
Álvaro Castro-González ◽  
José Carlos Castillo ◽  
Miguel Ángel Salichs

The success of social robotics is directly linked to their ability of interacting with people. Humans possess verbal and non-verbal communication skills, and, therefore, both are essential for social robots to get a natural human–robot interaction. This work focuses on the first of them since the majority of social robots implement an interaction system endowed with verbal capacities. In order to do this implementation, we must equip social robots with an artificial voice system. In robotics, a Text to Speech (TTS) system is the most common speech synthesizer technique. The performance of a speech synthesizer is mainly evaluated by its similarity to the human voice in relation to its intelligibility and expressiveness. In this paper, we present a comparative study of eight off-the-shelf TTS systems used in social robots. In order to carry out the study, 125 participants evaluated the performance of the following TTS systems: Google, Microsoft, Ivona, Loquendo, Espeak, Pico, AT&T, and Nuance. The evaluation was performed after observing videos where a social robot communicates verbally using one TTS system. The participants completed a questionnaire to rate each TTS system in relation to four features: intelligibility, expressiveness, artificiality, and suitability. In this study, four research questions were posed to determine whether it is possible to present a ranking of TTS systems in relation to each evaluated feature, or, on the contrary, there are no significant differences between them. Our study shows that participants found differences between the TTS systems evaluated in terms of intelligibility, expressiveness, and artificiality. The experiments also indicated that there was a relationship between the physical appearance of the robots (embodiment) and the suitability of TTS systems.


Sign in / Sign up

Export Citation Format

Share Document