Parsing Hierarchical Prosodic Structure for Mandarin Speech Synthesis

With the increasing maturity of speech synthesis technology, on the one hand, it has been more and more widely used in people’s lives; on the other hand, it also brings more and more convenience to people. The requirements for speech synthesis systems are getting higher and higher. Therefore, advanced technology is used to improve and update the accent recognition system. This paper mainly introduces the word stress annotation technology combined with neural network speech synthesis technology. In Chinese speech synthesis, prosodic structure prediction has a great influence on naturalness. The purpose of this paper is to accurately predict the prosodic structure, which has become an important problem to be solved in speech synthesis. Experimental data show that the average error of samples in the network training process is lel/85, and the minimum value of the training error after 500 steps is 0.00013127, so the final sample average error is lel = 85 ∗ 0.0013127 = 0.112 < 0.5, and use the deep neural network (DNN) to train different parameters to obtain the conversion model, and then synthesize these conversion models, and finally achieve the effect of improving the synthesized sound quality.

Download Full-text

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-021-01365-x ◽

2021 ◽

Author(s):

Junmin Liu ◽

Zhuangzhuang Xie ◽

Chunxia Zhang ◽

Guang Shi

Keyword(s):

Structure Prediction ◽

Speech Synthesis ◽

Prosodic Structure ◽

Novel Method

Download Full-text

Speech synthesis from natural models by hand and by algorithm

PsycEXTRA Dataset ◽

10.1037/e520562012-289 ◽

2009 ◽

Author(s):

Robert E. Remez ◽

Kathryn R. Dubowski ◽

Morgana L. Davids ◽

Emily F. Thomas ◽

Nina Paddu ◽

...

Keyword(s):

Speech Synthesis

Download Full-text

Design of English text-to-speech conversion algorithm based on machine learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189238 ◽

2020 ◽

pp. 1-12

Author(s):

Li Dongmei

Keyword(s):

Machine Learning ◽

Speech Synthesis ◽

Feature Recognition ◽

Learning Algorithm ◽

Morphological Structure ◽

English Text ◽

Text To Speech ◽

Part Of Speech ◽

Modern Computer ◽

Conversion Algorithm

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.

Download Full-text