Improving voice quality of HMM-based speech synthesis using voice conversion method

Author(s):  
Yishan Jiao ◽  
Xiang Xie ◽  
Xingyu Na ◽  
Ming Tu
2019 ◽  
Vol 78 (23) ◽  
pp. 33549-33572
Author(s):  
Mohammed Salah Al-Radhi ◽  
Tamás Gábor Csapó ◽  
Géza Németh

Abstract In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are converted acoustically. Traditional conversion methods focus on the prosodic feature represented by the discontinuous fundamental frequency (F0) and the spectral envelope. Studies have shown that speech analysis/synthesis solutions play an important role in the overall quality of the converted voice. Recently, we have proposed a new continuous vocoder, originally for statistical parametric speech synthesis, in which all parameters are continuous. Therefore, this work introduces a new method by using a continuous F0 (contF0) in SVC to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech. Our contribution includes the following. (1) We integrate into the SVC framework the continuous vocoder, which provides an advanced model of the excitation signal, by converting its contF0, maximum voiced frequency, and spectral features. (2) We show that the feed-forward deep neural network (FF-DNN) using our vocoder yields high quality conversion. (3) We apply a geometric approach to spectral subtraction (GA-SS) in the final stage of the proposed framework, to improve the signal-to-noise ratio of the converted speech. Our experimental results, using two male and one female speakers, have shown that the resulting converted speech with the proposed SVC technique is similar to the target speaker and gives state-of-the-art performance as measured by objective evaluation and subjective listening tests.


Biomimetics ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 12
Author(s):  
Marvin Coto-Jiménez

Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.


2015 ◽  
Vol 58 (3) ◽  
pp. 535-549 ◽  
Author(s):  
Mara R. Kapsner-Smith ◽  
Eric J. Hunter ◽  
Kimberly Kirkham ◽  
Karin Cox ◽  
Ingo R. Titze

PurposeAlthough there is a long history of use of semi-occluded vocal tract gestures in voice therapy, including phonation through thin tubes or straws, the efficacy of phonation through tubes has not been established. This study compares results from a therapy program on the basis of phonation through a flow-resistant tube (FRT) with Vocal Function Exercises (VFE), an established set of exercises that utilize oral semi-occlusions.MethodTwenty subjects (16 women, 4 men) with dysphonia and/or vocal fatigue were randomly assigned to 1 of 4 treatment conditions: (a) immediate FRT therapy, (b) immediate VFE therapy, (c) delayed FRT therapy, or (d) delayed VFE therapy. Subjects receiving delayed therapy served as a no-treatment control group.ResultsVoice Handicap Index (Jacobson et al., 1997) scores showed significant improvement for both treatment groups relative to the no-treatment group. Comparison of the effect sizes suggests FRT therapy is noninferior to VFE in terms of reduction in Voice Handicap Index scores. Significant reductions in Roughness on the Consensus Auditory-Perceptual Evaluation of Voice (Kempster, Gerratt, Verdolini Abbott, Barkmeier-Kraemer, & Hillman, 2009) were found for the FRT subjects, with no other significant voice quality findings.ConclusionsVFE and FRT therapy may improve voice quality of life in some individuals with dysphonia. FRT therapy was noninferior to VFE in improving voice quality of life in this study.


2002 ◽  
Vol 45 (4) ◽  
pp. 689-699 ◽  
Author(s):  
Donald G. Jamieson ◽  
Vijay Parsa ◽  
Moneca C. Price ◽  
James Till

We investigated how standard speech coders, currently used in modern communication systems, affect the quality of the speech of persons who have common speech and voice disorders. Three standardized speech coders (GSM 6.10 RPELTP, FS1016 CELP, and FS1015 LPC) and two speech coders based on subband processing were evaluated for their performance. Coder effects were assessed by measuring the quality of speech samples both before and after processing by the speech coders. Speech quality was rated by 10 listeners with normal hearing on 28 different scales representing pitch and loudness changes, speech rate, laryngeal and resonatory dysfunction, and coder-induced distortions. Results showed that (a) nine scale items were consistently and reliably rated by the listeners; (b) all coders degraded speech quality on these nine scales, with the GSM and CELP coders providing the better quality speech; and (c) interactions between coders and individual voices did occur on several voice quality scales.


2021 ◽  
Author(s):  
Chelsea Greene ◽  
Jesse Frey ◽  
William Magrogan ◽  
Cara O’Malley ◽  
Jaden Pieper

Logopedija ◽  
2018 ◽  
Vol 8 (1) ◽  
pp. 1-5 ◽  
Author(s):  
Anđela Bučević ◽  
Ana Bonetti ◽  
Luka Bonetti

The aim of this research paper was to examine the voice quality of sports coaches using the objective (acoustic) method. A total of 28 sports coaches (mean age 28.58, SD=5.08), from the City of Zagreb participated in this research. Recordings of the phonation of the vowel /a/ before and after one training session were obtained and analyzed using the PRAAT Program. Mean, minimal and maximal values of fundamental frequency, shimmer, jitter and harmonics-to-noise ratio were observed. The statistical analyses showed no statistically significant difference in acoustic voice quality of male and female coaches before and after the training session, or between male and female coaches. However, intra-individual differences among participants were observed, which may be significant in terms of their potential to affect the quality of their voices in the future.


Sign in / Sign up

Export Citation Format

Share Document