scholarly journals An Iterated Two-Step Sinusoidal Pitch Contour Formulation for Expressive Speech Synthesis

2021 ◽  
Vol 20 (No.4) ◽  
pp. 489-510
Author(s):  
Izzad Ramli ◽  
Nursuriati Jamil ◽  
Noraini Seman

Intonation generation in expressive speech such as storytelling is essential to produce high quality Malay language expressive speech synthesizer. Intonation generation, for instance explicit control, has shown good performance in terms of intelligibility with reasonably natural speech; thus, it was selected in this research. This approach modifies the prosodic features, such as pitch contour, intensity, and duration, to generate the intonation. However, modification of pitch contour remains a problem because the desired pitch contour is not achieved. This paper formulated an improved pitch contour algorithm to develop a modified pitch contour resembling the natural pitch contour. In this work, the syllable pitch contours of nine storytellers were extracted from their storytelling speeches to create an expressive speech syllable dataset called STORY_DATA. All the shapes of pitch contours from STORY_DATA were analyzed and clustered into the standard six main pitch contour clusters for storytelling. The clustering was performed using one minus the Pearson product moment correlation. Then, an improved iterative two-step sinusoidal pitch contour formulation was introduced to modify the pitch contours of a neutral speech into an expressive pitch contour of natural speeches. Overall, the improved pitch contour formulation was able to achieve 93 percent high correlated matches, indicating the high resemblance as compared to the previous pitch contour formulation at 15 percent. Therefore, the improved formula can be used in a text-to-speech (TTS) synthesizer to produce a more natural expressive speech. The paper also discovered unique expressive pitch contours in the Malay language that need further investigations in the future.

Author(s):  
Tejinder Kaur ◽  
Charanjiv Singh

Text-to-speech (TTS) is the generation ofsynthesized speech from text.Language is the ability to express one’sthoughts by means of a set of signs (text), gestures,and sounds. It is a distinctive feature of humanbeings, who are the only creatures to use such asystem. Speech is the oldest means of communicationbetween people and it is also the most widely used.‘Speech synthesis’ also called ‘Text to speechsynthesis’ is the artificial production ofhuman speech. A computer system used for thispurpose is called a speech synthesizer and can beimplemented in software. A text-to-speech(TTS) system converts text to speech.The proposed Enhanced Transcriptions Method is developed using Microsoft Visual Studio in VB.Net Language. Firstly word indexing is performed for the predefined words then corresponding speech signal is detected and errors in words are calculated using Euclidean distance. The results of the proposed work shows that Enhanced Transcriptions Method has more accuracy 89% as compared to previous Transcriptions Method 79%. The value of specificity for proposed method is 0.89 and for previous method is 0.79.


1992 ◽  
Vol 36 (3) ◽  
pp. 232-236
Author(s):  
Hiroshi Hamada ◽  
Jin'ichi Chiba

For the purpose of designing a method to control the main speech parameters for keyword emphasis in a text-to-speech synthesizer, the relation between speech parameters and emphasis level is determined from experiments. Twelve subjects are instructed to modify keyword emphasis to achieve natural sounding speech from three sentences. An interactive speech editor with a graphical user interface is developed for the experiments. The editor allows the subjects to control speech intensity, speech rate and average fundamental frequency of the keyword, and of the other sentence components. Furthermore, subjects can also control pause (silence) duration preceding and following the keyword. Extracted relations between prosodic feature parameters and emphasis level shows that speech intensity and speech rate are independent of sentence content. Speech intensity increases linearly and speech rate decreases linearly with emphasis level. On the other hand, average fundamental frequency and pause duration depend on sentence content, and relatively large changes are required to strongly emphasize keywords using pause insertion and increased fundamental frequency.


2020 ◽  
Vol 49 (4) ◽  
pp. 521-552 ◽  
Author(s):  
Elizabeth Couper-Kuhlen

AbstractThis study explores the link between prosody and other-repetition in a moderately large collection from everyday English talk-in-interaction (n = 200). British English and North American English cases were analysed separately in order to track possible varietal differences. Of initial interest was the question whether focal pitch accents might disambiguate among other-repetition actions, both those related to repair and those that go beyond repair. The results indicate that only two out of six possible other-repetition actions are associated with distinct focal pitch contours in the two varieties. For all other repair and beyond-repair actions speakers use many of the same pitch contours nondistinctively. Overall, falling contours appear more frequently in British other-repetitions, while rising contours are more frequent in North American other-repetitions. In conclusion, it is argued that in addition to pitch contour, prosodic features such as pitch span, loudness, and timing are crucial in distinguishing other-repetition actions, as are nonprosodic factors such as epistemic access (often reflected in oh-prefacing) and visible behavior. (Repair initiation, surprise, challenge, registering, pitch accents, oh-preface, epistemics)*


Sign in / Sign up

Export Citation Format

Share Document