A model of F0 contour for Vietnamese questions, applied in speech synthesis

Author(s):  
Anh-Tu Le ◽  
Do-Dat Tran ◽  
Thu-Trang Thi Nguyen
Keyword(s):  
2020 ◽  
Vol 10 (18) ◽  
pp. 6381 ◽  
Author(s):  
Pongsathon Janyoi ◽  
Pusadee Seresangtakul

The modeling of fundamental frequency (F0) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F0 for Isarn speech synthesis. We propose the F0 model for this based on a recurrent neural network (RNN). Sampled values of F0 are used at the syllable level of continuous Isarn speech combined with their dynamic features to represent supra-segmental properties of the F0 contour. Different architectures of the deep RNNs and different combinations of linguistic features are analyzed to obtain conditions for the best performance. To assess the proposed method, we compared it with several RNN-based baselines. The results of objective and subjective tests indicate that the proposed model significantly outperformed the baseline RNN model that predicts values of F0 at the frame level, and the baseline RNN model that represents the F0 contours of syllables by using discrete cosine transform.


2019 ◽  
Vol 28 (2S) ◽  
pp. 875-886 ◽  
Author(s):  
Jennifer M. Vojtech ◽  
Jacob P. Noordzij ◽  
Gabriel J. Cler ◽  
Cara E. Stepp

Purpose This study investigated how modulating fundamental frequency (f0) and speech rate differentially impact the naturalness, intelligibility, and communication efficiency of synthetic speech. Method Sixteen sentences of varying prosodic content were developed via a speech synthesizer. The f0 contour and speech rate of these sentences were altered to produce 4 stimulus sets: (a) normal rate with a fixed f0 level, (b) slow rate with a fixed f0 level, (c) normal rate with prosodically natural f0 variation, and (d) normal rate with prosodically unnatural f0 variation. Sixteen listeners provided orthographic transcriptions and judgments of naturalness for these stimuli. Results Sentences with f0 variation were rated as more natural than those with a fixed f0 level. Conversely, sentences with a fixed f0 level demonstrated higher intelligibility than those with f0 variation. Speech rate did not affect the intelligibility of stimuli with a fixed f0 level. Communication efficiency was highest for sentences produced at a normal rate and a fixed f0 level. Conclusions Sentence-level f0 variation increased naturalness ratings of synthesized speech, whether the variation was prosodically natural or not. However, these f0 variations reduced intelligibility. There is evidence of a trade-off in naturalness and intelligibility of synthesized speech, which may impact future speech synthesis designs. Supplemental Material https://doi.org/10.23641/asha.8847833


2020 ◽  
Vol 63 (11) ◽  
pp. 3855-3864
Author(s):  
Wanting Huang ◽  
Lena L. N. Wong ◽  
Fei Chen ◽  
Haihong Liu ◽  
Wei Liang

Purpose Fundamental frequency (F0) is the primary acoustic cue for lexical tone perception in tonal languages but is processed in a limited way in cochlear implant (CI) systems. The aim of this study was to evaluate the importance of F0 contours in sentence recognition in Mandarin-speaking children with CIs and find out whether it is similar to/different from that in age-matched normal-hearing (NH) peers. Method Age-appropriate sentences, with F0 contours manipulated to be either natural or flattened, were randomly presented to preschool children with CIs and their age-matched peers with NH under three test conditions: in quiet, in white noise, and with competing sentences at 0 dB signal-to-noise ratio. Results The neutralization of F0 contours resulted in a significant reduction in sentence recognition. While this was seen only in noise conditions among NH children, it was observed throughout all test conditions among children with CIs. Moreover, the F0 contour-induced accuracy reduction ratios (i.e., the reduction in sentence recognition resulting from the neutralization of F0 contours compared to the normal F0 condition) were significantly greater in children with CIs than in NH children in all test conditions. Conclusions F0 contours play a major role in sentence recognition in both quiet and noise among pediatric implantees, and the contribution of the F0 contour is even more salient than that in age-matched NH children. These results also suggest that there may be differences between children with CIs and NH children in how F0 contours are processed.


2009 ◽  
Author(s):  
Robert E. Remez ◽  
Kathryn R. Dubowski ◽  
Morgana L. Davids ◽  
Emily F. Thomas ◽  
Nina Paddu ◽  
...  
Keyword(s):  

2020 ◽  
pp. 1-12
Author(s):  
Li Dongmei

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.


Sign in / Sign up

Export Citation Format

Share Document