Thai speech synthesis with emotional tone: Based on Formant synthesis for Home Robot

Author(s):  
Chaiyong Khorinphan ◽  
Sukanya Phansamdaeng ◽  
Saiyan Saiyod
Author(s):  
Thierry Dutoit ◽  
Yannis Stylianou

Text-to-speech (TTS) synthesis is the art of designing talking machines. Seen from this functional perspective, the task looks simple, but this chapter shows that delivering intelligible, natural-sounding, and expressive speech, while also taking into account engineering costs, is a real challenge. Speech synthesis has made a long journey from the big controversy in the 1980s, between MIT’s formant synthesis and Bell Labs’ diphone-based concatenative synthesis. While unit selection technology, which appeared in the mid-1990s, can be seen as an extension of diphone-based approaches, the appearance of Hidden Markov Models (HMM) synthesis around 2005 resulted in a major shift back to models. More recently, the statistical approaches, supported by advanced deep learning architectures, have been shown to advance text analysis and normalization as well as the generation of the waveforms. Important recent milestones have been Google’s Wavenet (September 2016) and the sequence-to-sequence models referred to as Tacotron (I and II).


2013 ◽  
Vol 303-306 ◽  
pp. 1334-1337
Author(s):  
Zhi Ping Zhang ◽  
Xi Hong Wu

The authors proposed a trainable formant synthesis method based on the multi-channel Hidden Trajectory Model (HTM). In the method, the phonetic targets, formant trajectories and spectrum states from the oral, nasal, voiceless and background channels were designed to construct hierarchical hidden layers, and then spectrum were generated as observable features. In model training, the phonemic targets were learned from one-hour training speech data and the boundaries of phonemes were also aligned. The experimental results showed that the speech could be reconstructed with the formant trainable model by a source-filter synthesizer.


2020 ◽  
Author(s):  
Zofia Malisz ◽  
Gustav Eje Henter ◽  
Cassia Valentini-Botinhao ◽  
Oliver Watts ◽  
Jonas Beskow ◽  
...  

Decades of gradual advances in speech synthesis have recently culminated in exponential improvements fuelled by deep learning. This quantum leap has the potential to finally deliver realistic, controllable, and robust synthetic stimuli for speech experiments. In this article, we discuss these and other implications for phonetic sciences. We substantiate our argument by evaluating classic rule-based formant synthesis against state-of-the-art synthesisers on a) subjective naturalness ratings and b) a behavioural measure (reaction times in a lexical decision task). We also differentiate between text-to-speech and speech-to-speech methods. Naturalness ratings indicate that all modern systems are substantially closer to natural speech than formant synthesis. Reaction times for several modern systems do not differ substantially from natural speech, meaning that the processing gap observed in older systems, and reproduced with our formant synthesiser, is no longer evident. Importantly, some speech-to-speech methods are nearly indistinguishable from natural speech on both measures.


2007 ◽  
Vol 19 (6) ◽  
pp. 646-655 ◽  
Author(s):  
Seiji Aoyagi ◽  
◽  
Takahiro Yamaguchi ◽  
Kazuo Tsunemine ◽  
Hiroshi Kinomoto ◽  
...  

A multipurpose robot conducting domestic tasks should be indispensable for social needs, and this type of robot requires sophisticated technologies. A humanoid robot is not really practical at present for actual home or hospital use, considering its reliability and cost. To develop a practical multipurpose robot, we previously proposed the robot-environment compromise system (RECS) concept, which involves technology to modify a robot’s environment to increase robot performance. This concept aims to share the technical difficulties between the robot and the environment so that robot tasks are possible and facilitated. The present paper reports the development of an indoor mobile robot system based on the RECS concept that has a wheel mechanism to traverse steps. We propose a navigation system based on image recognition of landmarks on the ceiling and evaluated its effectiveness in experiments. We also propose a positioning system using a docking mechanism. We demonstrate our proposal’s feasibility using domestic tasks of setting a meal on a table and clearing away the dishes. We also developed a human interface system based on speech synthesis and recognition.


2019 ◽  
Vol 28 (3) ◽  
pp. 660-672
Author(s):  
Suzanne H. Kimball ◽  
Toby Hamilton ◽  
Erin Benear ◽  
Jonathan Baldwin

Purpose The purpose of this study was to evaluate the emotional tone and verbal behavior of social media users who self-identified as having tinnitus and/or hyperacusis that caused self-described negative consequences on daily life or health. Research Design and Method An explanatory mixed-methods design was utilized. Two hundred “initial” and 200 “reply” Facebook posts were collected from members of a tinnitus group and a hyperacusis group. Data were analyzed via the LIWC 2015 software program and compared to typical bloggers. As this was an explanatory mixed-methods study, we used qualitative thematic analyses to explain, interpret, and illustrate the quantitative results. Results Overall, quantitative results indicated lower overall emotional tone for all categories (tinnitus and hyperacusis, initial and reply), which was mostly influenced by higher negative emotion. Higher levels of authenticity or truth were found in the hyperacusis sample but not in the tinnitus sample. Lower levels of clout (social standing) were indicated in all groups, and a lower level of analytical thinking style (concepts and complex categories rather than narratives) was found in the hyperacusis sample. Additional analysis of the language indicated higher levels of sadness and anxiety in all groups and lower levels of anger, particularly for initial replies. These data support prior findings indicating higher levels of anxiety and depression in this patient population based on the actual words in blog posts and not from self-report questionnaires. Qualitative results identified 3 major themes from both the tinnitus and hyperacusis texts: suffering, negative emotional tone, and coping strategies. Conclusions Results from this study suggest support for the predominant clinical view that patients with tinnitus and hyperacusis have higher levels of anxiety and depression than the general population. The extent of the suffering described and patterns of coping strategies suggest clinical practice patterns and the need for research in implementing improved practice plans.


2009 ◽  
Author(s):  
Robert E. Remez ◽  
Kathryn R. Dubowski ◽  
Morgana L. Davids ◽  
Emily F. Thomas ◽  
Nina Paddu ◽  
...  
Keyword(s):  

2020 ◽  
pp. 1-12
Author(s):  
Li Dongmei

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.


Sign in / Sign up

Export Citation Format

Share Document