HMM-Based Vietnamese Speech Synthesis

2015 ◽  
Vol 3 (4) ◽  
pp. 33-47
Author(s):  
Son Trinh ◽  
Kiem Hoang

In this paper, improving naturalness HMM-based speech synthesis for Vietnamese language is described. By this synthesis method, trajectories of speech parameters are generated from the trained Hidden Markov models. A final speech waveform is synthesized from those speech parameters. The main objective for the development is to achieve maximum naturalness in output speech through key points. Firstly, system uses a high quality recorded Vietnamese speech database appropriate for training, especially in statistical parametric model approach. Secondly, prosodic informations such as tone, POS (part of speech) and features based on characteristics of Vietnamese language are added to ensure the quality of synthetic speech. Third, system uses STRAIGHT which showed its ability to produce high-quality voice manipulation and was successfully incorporated into HMM-based speech synthesis. The results collected show that the speech produced by our system has the best result when being compared with the other Vietnamese TTS systems trained from the same speech data.

Author(s):  
Hiroyuki Segi

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.


Biomimetics ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 12
Author(s):  
Marvin Coto-Jiménez

Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.


Author(s):  
Tetsuo Kosaka ◽  
Takashi Kusama ◽  
Masaharu Kato ◽  
Masaki Kohda

The aim of this work is to improve the recognition performance of spontaneous speech. In order to achieve the purpose, the authors of this chapter propose new approaches of unsupervised adaptation for spontaneous speech and evaluate the methods by using diagonal-covariance and full-covariance hidden Markov models. In the adaptation procedure, both methods of language model (LM) adaptation and acoustic model (AM) adaptation are used iteratively. Several combination methods are tested to find the optimal approach. In the LM adaptation, a word trigram model and a part-of-speech (POS) trigram model are combined to build a more task-specific LM. In addition, the authors propose an unsupervised speaker adaptation technique based on adaptation data weighting. The weighting is performed depending on POS class. In Japan, a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)” has been used as the common evaluation database for spontaneous speech and the authors used it for their recognition experiments. From the results, the proposed methods demonstrated a significant advantage in that task.


2012 ◽  
Vol 7 ◽  
Author(s):  
Ines Rehbein ◽  
Hagen Hirschmann ◽  
Anke Lüdeling ◽  
Marc Reznicek

Parsing learner data poses a great challenge for standard tools, since non-canonical and unusual structures may lead to wrong interpretations on the part of the taggers and parsers. It is well known that providing a statistical parser with perfect part-of-speech (POS) tags is of great benefit for parsing accuracy, and that parsing results can decrease considerably when the parser has to predict its own POS tags. Therefore one might expect that even small improvements in POS accuracy have a positive effect on parsing performance. In this paper we test this assumption and assess the impact of POS tag accuracy on constituency parsing for German learner language. We compare different strategies to manual correction of the learner text and specific POS tags, and we measure the time requirements for each strategy. We show that tagging a canonical equivalent of the non-canonical learner text substantially improves POS tag accuracy. Correcting selected POS tags can only lead to parsing results comparable to a setting where all POS tags are corrected, while reducing annotation time substantially. However, the manual corrections of the POS tags do not result in a statistically significant improvement for parsing, giving evidence for the high quality of the automatically predicted parts-of-speech for the corrected learner data.


Author(s):  
Duy Ninh Khánh

This paper describes the development and evaluation of a Vietnamese statistical speech synthesis system using the average voice approach. Although speaker-dependent systems have been applied extensively, no average voice based system has been developed for Vietnamese so far. We have collected speech data from several Vietnamese native speakers and employed state-of-the-art speech analysis, model training and speaker adaptation techniques to develop the system. Besides, we have performed perceptual experiments to compare the quality of speaker-adapted (SA) voices built on the average voice model and speaker-dependent (SD) voices built on SD models, and to confirm the effects of contextual features including word boundary (WB) and part-of-speech (POS) on the quality of synthetic speech. Evaluation results show that SA voices have significantly higher naturalness than SD voices when the same limited contextual feature set excluding WB and POS is used. In addition, SA voices trained with limited contextual features excluding WB and POS still have better quality than SD voices trained with full contextual features including WB and POS. These results show the robustness of the average voice method over the speaker-dependent approach for Vietnamese statistical speech synthesis.


2020 ◽  
pp. 1-12
Author(s):  
Li Dongmei

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.


Author(s):  
A. T. Kunakbaeva ◽  
A. M. Stolyarov ◽  
M. V. Potapova

Free-cutting steel gains specific working properties thanks to the high content of sulfur and phosphorus. These elements, especially sulfur, have a rather high tendency to segregation. Therefore, segregation defects in free-cutting steel continuously cast billets can be significantly developed. The aim of the work was to study the influence of the chemical composition of freecutting steel and casting technological parameters on the quality of the macrostructure of continuously cast billets. A metallographic assessment of the internal structure of cast metal made of free-cutting steel and data processing by application of correlation and regression analysis were the research methods. The array of production data of 43 heats of free-cutting steel of grade A12 was studied. Steel casting on a five-strand radial type continuous casting machine was carried out by various methods of metal pouring from tundish into the molds. Metal of 19 heats was poured with an open stream, and 24 heats – by a closed stream through submerged nozzles with a vertical hole. High-quality billets had a cross-sectional size of 150×150 mm. The macrostructure of high-quality square billets made of free-cutting steel of A12 grade is characterized by the presence of central porosity, axial segregation and peripheral point contamination, the degree of development of which was in the range from 1.5 to 2.0 points, segregation cracks and strips – about 1.0 points. In the course of casting with an open stream, almost all of these defects are more developed comparing with the casting by a closed stream. As a result of correlation and regression analysis, linear dependences of the development degree of segregation cracks and strips both axial and angular on the sulfur content in steel and on the ratio of manganese content to sulfur content were established. The degree of these defects development increases with growing of sulfur content in steel of A12 grade. These defects had especially strong development when sulfur content in steel was of more than 0.10%. To improve the quality of cast metal, it is necessary to have the ratio of the manganese content to the sulfur content in the metal more than eight.


2020 ◽  
pp. 52-58 ◽  
Author(s):  
A. A. Eryomenko ◽  
N. V. Rostunova ◽  
S. A. Budagyan ◽  
V. V. Stets

The experience of clinical testing of the personal telemedicine system ‘Obereg’ for remote monitoring of patients at the intensive care units of leading Russian clinics is described. The high quality of communication with the remote receiving devices of doctors, the accuracy of measurements, resistance to interference from various hospital equipment and the absence of its own impact on such equipment were confirmed. There are significant advantages compared to stationary patient monitors, in particular, for intra and out-of-hospital transportation of patients.


2018 ◽  
pp. 26-35
Author(s):  
Z. A. Agaeva ◽  
K. B. Baghdasaryan

The transthoracic echocardiography made by multifrequency probes with support of the mode of the second harmonic imaging, is a competitive method for visualization of the main coronary arteries and allows to estimate coronary blood flow with high quality. Of course, the method has considerable restrictions, most important of which is the low spatial resolution of a method, due to small acoustic window. Because of this the transthoracic visualization of coronary arteries perhaps will not become the leading method of anatomic reconstruction of separately taken coronary artery and especially all coronary arteries system. However uniqueness and indisputable advantage of this method is an opportunity to noninvasively estimate a coronary blood flow both once, and in dynamics.


2020 ◽  
Vol 18 (4) ◽  
pp. 739-752 ◽  
Author(s):  
R.M. Sadykov

Subject. This article deals with the issues of social justice and a high quality of life, creating favorable economic and social conditions. Objectives. The article aims to assess the rate and changes in poverty in Russia and the Republic of Bashkortostan and develop complementary measures to reduce it. Methods. For the study, I used the methods of logical, comparative, economic and statistical analyses, the results of sociological studies, and official statistics. Results. The article highlights additional measures to reduce poverty in the region, including the establishment of a minimum social standard of living in each particular region that determines the poverty rate. Conclusions. Various factors, such as economic sanctions, economic slowdowns, territorial and regional imbalances, lead to living standards decline and poverty rise.


Sign in / Sign up

Export Citation Format

Share Document