scholarly journals Isarn Dialect Speech Synthesis using HMM with syllable-context features

Author(s):  
Pongsathon Janyoi ◽  
Pusadee Seresangtakul

This paper describes the Isarn speech synthesis system, which is a regional dialect spoken in the Northeast of Thailand. In this study, we focus to improve the prosody generation of the system by using the additional context features. In order to develop the system, the speech parameters (Mel-ceptrum and fundamental frequencies of phoneme within different phonetic contexts) were modelled using Hidden Markov Models (HMM). Synthetic speech was generated by converting the input text into context-dependent phonemes. Speech parameters were generated from the trained HMM, according to the context-dependent phonemes, and were then synthesized through a speech vocoder. In this study, systems were trained using three different feature sets: basic contextual features, tonal, and syllable-context features. Objective and subjective tests were conducted to determine the performance of the proposed system. The results indicated that the addition of the syllable-context features significantly improved the naturalness of synthesized speech.

Author(s):  
Mahbubur R. Syed ◽  
Shuvro Chakrobartty ◽  
Robert J. Bignall

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.


2020 ◽  
Vol 17 (6) ◽  
pp. 906-915
Author(s):  
Pongsathon Janyoi ◽  
Pusadee Seresangtakul

The generation of the fundamental frequency (F0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F0 is predicted frame-by-frame. This method is insufficient to represent F0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F0 model that represents F0 contours within syllables, using syllable-level F0 parameters that comprise the sampling F0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F0 values at the frame level


Author(s):  
Vo Quang Dieu Ha ◽  
Nguyen Manh Tuan ◽  
Cao Xuan Nam ◽  
Pham Minh Nhut ◽  
Vu Hai Quan

This paper presents a complete specification of the  Vietnamese  speech  synthesis  system  named  VOS (Voice  of  Southern  Vietnam).  Due  to  the  fact  that current  Vietnamese  text-to-speech  systems  lack  the naturalness of output synthetic speech, VOS is based on the  unit  selection  approach  which  aims  to  achieve maximum  naturalness.  There  are  three  main  parts constituting VOS: a corpus manager, a synthesizer, and a  transliteration  model.  Corpus  manager  manages automated  speech  indexing  and  segmentation  for  unit selection  executed  by  the  synthesizer,  while transliteration  model  deals  with  the  pronunciation  of words  in  foreign  languages.  A  comparative experimental  evaluation  of  VnSpeech,  VietVoice,  and VOS  is  conducted  using  ITU-T  P.85  standard.  Results show  that  VOS  outperforms  the  former  two  TTS systems.


1992 ◽  
Vol 91 (4) ◽  
pp. 2305-2305
Author(s):  
Bathsheba J. Malsheen ◽  
Gabriel F. Groner ◽  
Linda D. Williams

Author(s):  
Berthold Crysmann ◽  
Philipp Von Böselager

In this paper, we report on an experiment showing how the introduction of prosodic information from detailed syntactic structures into synthetic speech leads to better disambiguation of structurally ambiguous sentences. Using modifier attachment (MA) ambiguities and subject/object fronting (OF) in German as test cases, we show that prosody which is automatically generated from deep syntactic information provided by an HPSG generator can lead to considerable disambiguation effects, and can even override a strong semantics-driven bias. The architecture used in the experiment, consisting of the LKB generator running a large-scale grammar for German, a syntax-prosody interface module, and the speech synthesis system MARY is shown to be a valuable platform for testing hypotheses in intonation studies.


2014 ◽  
Vol 571-572 ◽  
pp. 858-862
Author(s):  
Zhi Qiang Wu ◽  
Hong Zhi Yu ◽  
Shu Hui Wan

Pronunciation conversion is the premise to realize the speech synthesis system, besides, the conversion accuracy is directly related to the quality of synthetic speech. By studying the characteristics of Tibetan words and Lhasa pronunciation, currently method of the pronunciation conversion for Tibetan dialect in Lhasa, combination the need of speech synthesis research, designed and realized the pronunciation conversion system that can be applied in the Lhasa dialect of Tibetan speech synthesis. In tests the system is up to 95.3 percent accurate, the results of conversion are basically able to meet the needs of the Tibetan speech synthesis system.


Author(s):  
S.J. Eady ◽  
T.M.S. Hemphill ◽  
J.R. Woolsey ◽  
J.A.W. Clayards

Sign in / Sign up

Export Citation Format

Share Document