Experiment with Evaluation of Quality of the Synthetic Speech by the GMM Classifier

Author(s):  
Jiří Přibil ◽  
Anna Přibilová ◽  
Jindřich Matoušek
2014 ◽  
Vol 8 (14) ◽  
pp. 1691-1694
Author(s):  
Lau Chee Yong ◽  
Tan Tian Swee ◽  
Mohd Nizam Mazenan
Keyword(s):  

2014 ◽  
Vol 1030-1032 ◽  
pp. 1638-1641
Author(s):  
Yan Ming ◽  
Li Zhen Wang ◽  
Xu Jiu Xia

A 4kbps vocoder based on MELP is presented in this paper. It uses the parameter encoding and mixed excitation technology to ensure the quality of speech. Through adopting the scalar quantization of Line Spectrum Frequency (LSF), the algorithm reduces the storage and computational complexity. Meanwhile, 4kbps vocoder adds a new frame type-transition frame. The classifier can reduce the U/V decision errors and avoid excessive switching between voiced frame and unvoiced frame. A modified bit allocation table is introduced and the PESQ-MOS and coding time test shows that the synthetic speech quality has been improved and reached the quality of communication.


2020 ◽  
Vol 11 (1) ◽  
pp. 2
Author(s):  
Jiří Přibil ◽  
Anna Přibilová ◽  
Jindřich Matoušek

The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classifier. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The final evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial influence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classification process and on the final evaluation result. The main evaluation experiments confirm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.


Author(s):  
Marvin Coto-Jiménez

Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.


Author(s):  
Alexander L. Francis ◽  
Howard C. Nusbaum
Keyword(s):  

Biomimetics ◽  
2019 ◽  
Vol 4 (2) ◽  
pp. 39 ◽  
Author(s):  
Marvin Coto-Jiménez

Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.


2013 ◽  
Vol 401-403 ◽  
pp. 1282-1286
Author(s):  
Qiang Li ◽  
Li Zhen Wang ◽  
Xu Jiu Xia

A low-complexity 3.6kb/s speech coding algorithm based on mixed excitation is presented in this paper. It uses the parameter encoding and mixed excitation technology to ensure the quality of speech. Through adopting the scalar quantization of Line Spectrum Frequency (LSF), the algorithm reduces the storage and computational complexity. Meanwhile, improved frame type with dynamic Unvoiced/Voiced (U/V) thresholds make a reduction of the traditional U/V decision error and the sudden transformation of U/V frame. A modified bit allocation table is introduced and the PESQ-MOS test shows that the synthetic speech quality has been improved and reached the quality of communication, especially for high frequency female speakers with new frame type.


Sign in / Sign up

Export Citation Format

Share Document