Acquisition of Lip-Sync Expressions Using Transfer Learning for Text-to-Speech Emotional Expression Agents

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Download Full-text

Text-To-Speech Synthesis Using Transfer Learning

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-956 ◽

2021 ◽

pp. 139-144

Author(s):

Ishita Satija ◽

Vina Lomte ◽

Yash Wani ◽

Digisha Kaneria ◽

Shubham Yadav

Keyword(s):

Transfer Learning ◽

Speech Synthesis ◽

Text To Speech ◽

Neural Organization ◽

Proposed Model ◽

Backward Wave ◽

Text To Speech Synthesis ◽

The Voice

We portray a neural organization based framework for text-to-speech (TTS) combination that can create discourse sound in the voice of various speakers, including those concealed during preparation. Our framework comprises of three autonomously prepared parts: (1) a speaker encoder network; (2) a grouping to-succession union organization based on Tacotron 2; (3) an auto-backward Wave Net-based vocoder network. We illustrate that the proposed model can move the information on speaker fluctuation learned by the discriminatively-prepared speaker encoder to the multi speaker TTS task, and can incorporate normal discourse from speakers concealed during preparation. We measure the significance of preparing the speaker encoder on a huge and different speaker set to acquire the best speculation execution. At last, we show that haphazardly inspected speaker embeddings can be utilized to integrate discourse in the voice of novel speakers divergent from those utilized in preparing, showing that the model has taken in a top-notch speaker portrayal.

Download Full-text

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

IEEE Access ◽

10.1109/access.2022.3141200 ◽

2022 ◽

pp. 1-1

Author(s):

Kurniawati Azizah ◽

Wisnu Jatmiko

Keyword(s):

Learning Style ◽

Transfer Learning ◽

Text To Speech ◽

Low Resource

Download Full-text

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

10.21437/interspeech.2019-2730 ◽

2019 ◽

Cited By ~ 5

Author(s):

Yuan-Jui Chen ◽

Tao Tu ◽

Cheng-chieh Yeh ◽

Hung-Yi Lee

Keyword(s):

Transfer Learning ◽

Text To Speech ◽

Low Resource ◽

End To End ◽

Cross Lingual

Download Full-text

Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis

10.21437/interspeech.2020-1297 ◽

2020 ◽

Author(s):

Ajinkya Kulkarni ◽

Vincent Colotte ◽

Denis Jouvet

Keyword(s):

Transfer Learning ◽

Speech Synthesis ◽

Metric Learning ◽

Text To Speech ◽

Text To Speech Synthesis

Download Full-text

Coarticulation and speech synchronization in MPEG-4 based facial animation

Kybernetes ◽

10.1108/k-07-2014-0139 ◽

2014 ◽

Vol 43 (8) ◽

pp. 1165-1182 ◽

Cited By ~ 2

Author(s):

Ricardo Leandro Parreira Duarte ◽

Abdennour El Rhalibi ◽

Madjid Merabti

Keyword(s):

High Resolution ◽

High Performance ◽

Design Methodology ◽

Facial Animation ◽

3D Models ◽

Model Simplification ◽

Text To Speech ◽

Time Model ◽

Content Type ◽

Lip Sync

Purpose – The purpose of this paper is to present a novel coarticulation and speech synchronization framework compliant with MPEG-4 facial animation (FA). Design/methodology/approach – The system the authors have developed uses MPEG-4 FA standard and other development to enable the creation, editing and playback of high-resolution 3D models; MPEG-4 animation streams; and is compatible with well-known related systems such as Greta and Xface. It supports text-to-speech for dynamic speech synchronization. The framework enables real-time model simplification using quadric-based surfaces. Findings – The preliminary experiments show that the coarticulation technique the authors have developed gives overall good and promising results when compared to related techniques. Originality/value – The coarticulation approach provides realistic and high performance lip-sync animation, based on Cohen-Massaro's model of coarticulation adapted to MPEG-4 FA specification.

Download Full-text

Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Javanese, and Sundanese Languages

2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS) ◽

10.1109/icacsis51025.2020.9263086 ◽

2020 ◽

Author(s):

Kurniawati Azizah ◽

Mirna Adriani

Keyword(s):

Transfer Learning ◽

Text To Speech

Download Full-text

Acquisition of Lip-Sync Expressions Using Transfer Learning for Text-to-Speech Emotional Expression Agents

A Transfer Learning End-to-End Arabic Text-To-Speech (TTS) Deep Architecture

Humanising Text-to-Speech Through Emotional Expression in Online Courses

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Text-To-Speech Synthesis Using Transfer Learning

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-Speech Synthesis

Coarticulation and speech synchronization in MPEG-4 based facial animation

Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Javanese, and Sundanese Languages

Export Citation Format