Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance

Author(s):  
Songxiang Liu ◽  
Jinghua Zhong ◽  
Lifa Sun ◽  
Xixin Wu ◽  
Xunying Liu ◽  
...  
2021 ◽  
Vol 11 (16) ◽  
pp. 7489
Author(s):  
Mohammed Salah Al-Radhi ◽  
Tamás Gábor Csapó ◽  
Géza Németh

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.


2019 ◽  
Vol 10 (1) ◽  
pp. 151
Author(s):  
Xiaokong Miao ◽  
Meng Sun ◽  
Xiongwei Zhang ◽  
Yimin Wang

This paper presents a noise-robust voice conversion method with high-quefrency boosting via sub-band cepstrum conversion and fusion based on the bidirectional long short-term memory (BLSTM) neural networks that can convert parameters of vocal tracks of a source speaker into those of a target speaker. With the implementation of state-of-the-art machine learning methods, voice conversion has achieved good performance given abundant clean training data. However, the quality and similarity of the converted voice are significantly degraded compared to that of a natural target voice due to various factors, such as limited training data and noisy input speech from the source speaker. To address the problem of noisy input speech, an architecture of voice conversion with statistical filtering and sub-band cepstrum conversion and fusion is introduced. The impact of noises on the converted voice is reduced by the accurate reconstruction of the sub-band cepstrum and the subsequent statistical filtering. By normalizing the mean and variance of the converted cepstrum to those of the target cepstrum in the training phase, a cepstrum filter was constructed to further improve the quality of the converted voice. The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.


Author(s):  
Fangkun Liu ◽  
Hui Wang ◽  
Renhua Peng ◽  
Chengshi Zheng ◽  
Xiaodong Li

AbstractVoice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.


2020 ◽  
Vol 5 (3) ◽  
pp. 229-233
Author(s):  
Olaide Ayodeji Agbolade

This research presents a neural network based voice conversion model. While it is a known fact that voiced sounds and prosody are the most important component of the voice conversion framework, what is not known is their objective contributions particularly in a noisy and uncontrolled environment. This model uses a 3 layer feedforward neural network to map the Linear prediction analysis coefficients of a source speaker to the acoustic vector space of the target speaker with a view to objectively determine the contributions of the voiced, unvoiced and supra-segmental components of sounds to the voice conversion model. Results showed that vowels “a”, “i”, “o” have the most significant contribution in the conversion success. The voiceless sounds were also found to be most affected by the noisy training data. An average noise level of 40 dB above the noise floor were found to degrade the voice conversion success by 55.14 percent relative to the voiced sounds. The result also show that for cross-gender voice conversion, prosody conversion is more significant in scenarios where a female is the target speaker.


2018 ◽  
Vol 2018 ◽  
pp. 1-14
Author(s):  
XueTing Wang ◽  
Cong Jin ◽  
Wei Zhao

Speech synthesis is an important research content in the field of human-computer interaction and has a wide range of applications. As one of its branches, singing synthesis plays an important role. Beijing Opera is a famous traditional Chinese opera, and it is called Chinese quintessence. The singing of Beijing Opera carries some features of speech but it has its own unique pronunciation rules and rhythms which differ from ordinary speech and singing. In this paper, we propose three models for the synthesis of Beijing Opera. Firstly, the speech signals of the source speaker and the target speaker are extracted by using the straight algorithm. And then through the training of GMM, we complete the voice control model to input the voice to be converted and output the voice after the voice conversion. Finally, by modeling the fundamental frequency, duration, and frequency separately, a melodic control model is constructed using GAN to realize the synthesis of the Beijing Opera fragment. We connect the fragments and superimpose the background music to achieve the synthesis of Beijing Opera. The experimental results show that the synthesized Beijing Opera has some audibility and can basically complete the composition of Beijing Opera. We also extend our models to human-AI cooperative music generation: given a target voice of human, we can generate a Beijing Opera which is sung by a new target voice.


Author(s):  
S.K. Streiffer ◽  
C.B. Eom ◽  
J.C. Bravman ◽  
T.H. Geballet

The study of very thin (<15 nm) YBa2Cu3O7−δ (YBCO) films is necessary both for investigating the nucleation and growth of films of this material and for achieving a better understanding of multilayer structures incorporating such thin YBCO regions. We have used transmission electron microscopy to examine ultra-thin films grown on MgO substrates by single-target, off-axis magnetron sputtering; details of the deposition process have been reported elsewhere. Briefly, polished MgO substrates were attached to a block placed at 90° to the sputtering target and heated to 650 °C. The sputtering was performed in 10 mtorr oxygen and 40 mtorr argon with an rf power of 125 watts. After deposition, the chamber was vented to 500 torr oxygen and allowed to cool to room temperature. Because of YBCO’s susceptibility to environmental degradation and oxygen loss, the technique of Xi, et al. was followed and a protective overlayer of amorphous YBCO was deposited on the just-grown films.


2011 ◽  
Author(s):  
Axel Zinkernagel ◽  
Wilhelm Hofmann ◽  
Friederike X. R. Dislich ◽  
Tobias Gschwendner ◽  
Manfred Schmitt

Sign in / Sign up

Export Citation Format

Share Document