scholarly journals Using an HPSG grammar for the generation of prosody

Author(s):  
Berthold Crysmann ◽  
Philipp Von Böselager

In this paper, we report on an experiment showing how the introduction of prosodic information from detailed syntactic structures into synthetic speech leads to better disambiguation of structurally ambiguous sentences. Using modifier attachment (MA) ambiguities and subject/object fronting (OF) in German as test cases, we show that prosody which is automatically generated from deep syntactic information provided by an HPSG generator can lead to considerable disambiguation effects, and can even override a strong semantics-driven bias. The architecture used in the experiment, consisting of the LKB generator running a large-scale grammar for German, a syntax-prosody interface module, and the speech synthesis system MARY is shown to be a valuable platform for testing hypotheses in intonation studies.

2021 ◽  
Vol 14 (3) ◽  
pp. 1-26
Author(s):  
Danielle Bragg ◽  
Katharina Reinecke ◽  
Richard E. Ladner

As conversational agents and digital assistants become increasingly pervasive, understanding their synthetic speech becomes increasingly important. Simultaneously, speech synthesis is becoming more sophisticated and manipulable, providing the opportunity to optimize speech rate to save users time. However, little is known about people’s abilities to understand fast speech. In this work, we provide an extension of the first large-scale study on human listening rates, enlarging the prior study run with 453 participants to 1,409 participants and adding new analyses on this larger group. Run on LabintheWild, it used volunteer participants, was screen reader accessible, and measured listening rate by accuracy at answering questions spoken by a screen reader at various rates. Our results show that people who are visually impaired, who often rely on audio cues and access text aurally, generally have higher listening rates than sighted people. The findings also suggest a need to expand the range of rates available on personal devices. These results demonstrate the potential for users to learn to listen to faster rates, expanding the possibilities for human-conversational agent interaction.


Author(s):  
Mahbubur R. Syed ◽  
Shuvro Chakrobartty ◽  
Robert J. Bignall

Speech synthesis is the process of producing natural-sounding, highly intelligible synthetic speech simulated by a machine in such a way that it sounds as if it was produced by a human vocal system. A text-to-speech (TTS) synthesis system is a computer-based system where the input is text and the output is a simulated vocalization of that text. Before the 1970s, most speech synthesis was achieved with hardware, but this was costly and it proved impossible to properly simulate natural speech production. Since the 1970s, the use of computers has made the practical application of speech synthesis more feasible.


Author(s):  
Vo Quang Dieu Ha ◽  
Nguyen Manh Tuan ◽  
Cao Xuan Nam ◽  
Pham Minh Nhut ◽  
Vu Hai Quan

This paper presents a complete specification of the  Vietnamese  speech  synthesis  system  named  VOS (Voice  of  Southern  Vietnam).  Due  to  the  fact  that current  Vietnamese  text-to-speech  systems  lack  the naturalness of output synthetic speech, VOS is based on the  unit  selection  approach  which  aims  to  achieve maximum  naturalness.  There  are  three  main  parts constituting VOS: a corpus manager, a synthesizer, and a  transliteration  model.  Corpus  manager  manages automated  speech  indexing  and  segmentation  for  unit selection  executed  by  the  synthesizer,  while transliteration  model  deals  with  the  pronunciation  of words  in  foreign  languages.  A  comparative experimental  evaluation  of  VnSpeech,  VietVoice,  and VOS  is  conducted  using  ITU-T  P.85  standard.  Results show  that  VOS  outperforms  the  former  two  TTS systems.


Author(s):  
Pongsathon Janyoi ◽  
Pusadee Seresangtakul

This paper describes the Isarn speech synthesis system, which is a regional dialect spoken in the Northeast of Thailand. In this study, we focus to improve the prosody generation of the system by using the additional context features. In order to develop the system, the speech parameters (Mel-ceptrum and fundamental frequencies of phoneme within different phonetic contexts) were modelled using Hidden Markov Models (HMM). Synthetic speech was generated by converting the input text into context-dependent phonemes. Speech parameters were generated from the trained HMM, according to the context-dependent phonemes, and were then synthesized through a speech vocoder. In this study, systems were trained using three different feature sets: basic contextual features, tonal, and syllable-context features. Objective and subjective tests were conducted to determine the performance of the proposed system. The results indicated that the addition of the syllable-context features significantly improved the naturalness of synthesized speech.


2014 ◽  
Vol 571-572 ◽  
pp. 858-862
Author(s):  
Zhi Qiang Wu ◽  
Hong Zhi Yu ◽  
Shu Hui Wan

Pronunciation conversion is the premise to realize the speech synthesis system, besides, the conversion accuracy is directly related to the quality of synthetic speech. By studying the characteristics of Tibetan words and Lhasa pronunciation, currently method of the pronunciation conversion for Tibetan dialect in Lhasa, combination the need of speech synthesis research, designed and realized the pronunciation conversion system that can be applied in the Lhasa dialect of Tibetan speech synthesis. In tests the system is up to 95.3 percent accurate, the results of conversion are basically able to meet the needs of the Tibetan speech synthesis system.


Author(s):  
S.J. Eady ◽  
T.M.S. Hemphill ◽  
J.R. Woolsey ◽  
J.A.W. Clayards

Sign in / Sign up

Export Citation Format

Share Document