Voice Conversion from Arbitrary Speakers Based on Deep Neural Networks with Adversarial Learning

Author(s):  
Sou Miyamoto ◽  
Takashi Nose ◽  
Suzunosuke Ito ◽  
Harunori Koike ◽  
Yuya Chiba ◽  
...  
2020 ◽  
Vol 34 (07) ◽  
pp. 11749-11756 ◽  
Author(s):  
Toshihiko Matsuura ◽  
Tatsuya Harada

When domains, which represent underlying data distributions, vary during training and testing processes, deep neural networks suffer a drop in their performance. Domain generalization allows improvements in the generalization performance for unseen target domains by using multiple source domains. Conventional methods assume that the domain to which each sample belongs is known in training. However, many datasets, such as those collected via web crawling, contain a mixture of multiple latent domains, in which the domain of each sample is unknown. This paper introduces domain generalization using a mixture of multiple latent domains as a novel and more realistic scenario, where we try to train a domain-generalized model without using domain labels. To address this scenario, we propose a method that iteratively divides samples into latent domains via clustering, and which trains the domain-invariant feature extractor shared among the divided latent domains via adversarial learning. We assume that the latent domain of images is reflected in their style, and thus, utilize style features for clustering. By using these features, our proposed method successfully discovers latent domains and achieves domain generalization even if the domain labels are not given. Experiments show that our proposed method can train a domain-generalized model without using domain labels. Moreover, it outperforms conventional domain generalization methods, including those that utilize domain labels.


2020 ◽  
Vol 17 (1) ◽  
pp. 316-321
Author(s):  
V. Naveena ◽  
Susmitha Vekkot ◽  
K. Jeeva Priya

The paper focuses on usage of deep neural networks for converting a person’s voice to another person’s voice, analogous to a mimic. The work in this paper introduces the concept of neural networks and deploys multi-layer deep neural networks for building a framework for voice conversion. The spectral Mel-Frequency Cepstral Coefficients (MFCCs) are converted using a 10-layer deep network while fundamental frequency (F0) conversion is accomplished by logarithmic Gaussian normalized transformation. MFCCs are subjected to inverse cepstral filtering while changes in F0 are incorporated using Pitch Synchronous OverLap Add (PSOLA) algorithm for re-synthesis. The results obtained are compared using Mel Cepstral Distortion (MCD) for objective evaluation while ABX-listening test is conducted for subjective assessment. Maximum improvement in MCD of 13.87% is obtained for female-to-male conversion while ABX-listening test indicates that female-to-male is closest to target with an agreement of 76.2%. The method achieves reasonably good performance compared to state-of-the-art using optimal resources and avoids requirement of highly complex computations.


Author(s):  
Michael Gian V. Gonzales ◽  
Crisron Rudolf G. Lucas ◽  
Michael Gringo Angelo R. Bayona ◽  
Franz A. De Leon

Sign in / Sign up

Export Citation Format

Share Document