Isarn Dialect Speech Synthesis using HMM with syllable-context features

This paper describes the Isarn speech synthesis system, which is a regional dialect spoken in the Northeast of Thailand. In this study, we focus to improve the prosody generation of the system by using the additional context features. In order to develop the system, the speech parameters (Mel-ceptrum and fundamental frequencies of phoneme within different phonetic contexts) were modelled using Hidden Markov Models (HMM). Synthetic speech was generated by converting the input text into context-dependent phonemes. Speech parameters were generated from the trained HMM, according to the context-dependent phonemes, and were then synthesized through a speech vocoder. In this study, systems were trained using three different feature sets: basic contextual features, tonal, and syllable-context features. Objective and subjective tests were conducted to determine the performance of the proposed system. The results indicated that the addition of the syllable-context features significantly improved the naturalness of synthesized speech.

Download Full-text

Text-to-Speech Synthesis

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch135 ◽

2011 ◽

pp. 957-963

Author(s):

Mahbubur R. Syed ◽

Shuvro Chakrobartty ◽

Robert J. Bignall

Keyword(s):

Speech Production ◽

Speech Synthesis ◽

Synthetic Speech ◽

Practical Application ◽

Text To Speech ◽

Synthesis System ◽

System A ◽

Vocal System ◽

Text To Speech Synthesis ◽

Computer Based

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.

Download Full-text

F0 Modeling for Isarn Speech Synthesis using Deep Neural Networks and Syllable-level Feature Representation

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/6/9 ◽

2020 ◽

Vol 17 (6) ◽

pp. 906-915

Author(s):

Pongsathon Janyoi ◽

Pusadee Seresangtakul

Keyword(s):

Speech Synthesis ◽

Deep Neural Networks ◽

Markov Models ◽

Feature Representation ◽

Context Dependency ◽

Dynamic Features ◽

Synthesis System ◽

Proposed Model ◽

Training Sets

The generation of the fundamental frequency (F0) plays an important role in speech synthesis, which directly influences the naturalness of synthetic speech. In conventional parametric speech synthesis, F0 is predicted frame-by-frame. This method is insufficient to represent F0 contours in larger units, especially tone contours of syllables in tonal languages that deviate as a result of long-term context dependency. This work proposes a syllable-level F0 model that represents F0 contours within syllables, using syllable-level F0 parameters that comprise the sampling F0 points and dynamic features. A Deep Neural Network (DNN) was used to represent the relationships between syllable-level contextual features and syllable-level F0 parameters. The proposed model was examined using an Isarn speech synthesis system with both large and small training sets. For all training sets, the results of objective and subjective tests indicate that the proposed approach outperforms the baseline systems based on hidden Markov models and DNNS that predict F0 values at the frame level

Download Full-text

Context features based pre-selection and weight prediction in concatenation speech synthesis system

The 9th International Symposium on Chinese Spoken Language Processing ◽

10.1109/iscslp.2014.6936611 ◽

2014 ◽

Author(s):

Shanfeng Liu ◽

Zhengqi Wen ◽

Ya Li ◽

Jianghua Tao ◽

Bin Liu

Keyword(s):

Speech Synthesis ◽

Synthesis System ◽

Weight Prediction ◽

Context Features

Download Full-text

VOS: the Corpus-Based etnamese Text-to-Speech System

Research and Development on Information and Communication Technology ◽

10.32913/mic-ict-research.v3.n7.285 ◽

2010 ◽

Author(s):

Vo Quang Dieu Ha ◽

Nguyen Manh Tuan ◽

Cao Xuan Nam ◽

Pham Minh Nhut ◽

Vu Hai Quan

Keyword(s):

Experimental Evaluation ◽

Speech Synthesis ◽

Foreign Languages ◽

Synthetic Speech ◽

Text To Speech ◽

Synthesis System ◽

Unit Selection ◽

Southern Vietnam ◽

Selection Approach ◽

Complete Specification

This paper presents a complete specification of the Vietnamese speech synthesis system named VOS (Voice of Southern Vietnam). Due to the fact that current Vietnamese text-to-speech systems lack the naturalness of output synthetic speech, VOS is based on the unit selection approach which aims to achieve maximum naturalness. There are three main parts constituting VOS: a corpus manager, a synthesizer, and a transliteration model. Corpus manager manages automated speech indexing and segmentation for unit selection executed by the synthesizer, while transliteration model deals with the pronunciation of words in foreign languages. A comparative experimental evaluation of VnSpeech, VietVoice, and VOS is conducted using ITU-T P.85 standard. Results show that VOS outperforms the former two TTS systems.

Download Full-text

Text to speech synthesis system and method using context dependent vowel allophones

The Journal of the Acoustical Society of America ◽

10.1121/1.403608 ◽

1992 ◽

Vol 91 (4) ◽

pp. 2305-2305

Author(s):

Bathsheba J. Malsheen ◽

Gabriel F. Groner ◽

Linda D. Williams

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Synthesis System ◽

Text To Speech Synthesis ◽

Context Dependent

Download Full-text

Using an HPSG grammar for the generation of prosody

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar ◽

10.21248/hpsg.2007.5 ◽

2007 ◽

Author(s):

Berthold Crysmann ◽

Philipp Von Böselager

Keyword(s):

Speech Synthesis ◽

Large Scale ◽

Synthetic Speech ◽

Test Cases ◽

Synthesis System ◽

Testing Hypotheses ◽

Syntactic Structures ◽

Interface Module ◽

Syntactic Information

In this paper, we report on an experiment showing how the introduction of prosodic information from detailed syntactic structures into synthetic speech leads to better disambiguation of structurally ambiguous sentences. Using modifier attachment (MA) ambiguities and subject/object fronting (OF) in German as test cases, we show that prosody which is automatically generated from deep syntactic information provided by an HPSG generator can lead to considerable disambiguation effects, and can even override a strong semantics-driven bias. The architecture used in the experiment, consisting of the LKB generator running a large-scale grammar for German, a syntax-prosody interface module, and the speech synthesis system MARY is shown to be a valuable platform for testing hypotheses in intonation studies.

Download Full-text

Building HMM based unit-selection speech synthesis system using synthetic speech naturalness evaluation score

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2011.5947567 ◽

2011 ◽

Cited By ~ 3

Author(s):

Heng Lu ◽

Zhen-Hua Ling ◽

Li-Rong Dai ◽

Ren-Hua Wang

Keyword(s):

Speech Synthesis ◽

Synthetic Speech ◽

Synthesis System ◽

Unit Selection ◽

Evaluation Score ◽

Speech Naturalness

Download Full-text

Research and Realization the Method of Pronunciation Conversion for Speech Synthesis of the Lhasa Dialect of Tibetan

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.858 ◽

2014 ◽

Vol 571-572 ◽

pp. 858-862

Author(s):

Zhi Qiang Wu ◽

Hong Zhi Yu ◽

Shu Hui Wan

Keyword(s):

Speech Synthesis ◽

Synthetic Speech ◽

Synthesis System ◽

Conversion System ◽

Synthesis Research

Pronunciation conversion is the premise to realize the speech synthesis system, besides, the conversion accuracy is directly related to the quality of synthetic speech. By studying the characteristics of Tibetan words and Lhasa pronunciation, currently method of the pronunciation conversion for Tibetan dialect in Lhasa, combination the need of speech synthesis research, designed and realized the pronunciation conversion system that can be applied in the Lhasa dialect of Tibetan speech synthesis. In tests the system is up to 95.3 percent accurate, the results of conversion are basically able to meet the needs of the Tibetan speech synthesis system.

Download Full-text