Stochastic Talking Face Generation Using Latent Distribution Matching

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

Download Full-text

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

IEEE Transactions on Multimedia ◽

10.1109/tmm.2021.3099900 ◽

2021 ◽

pp. 1-1

Author(s):

Sefik Emre Eskimez ◽

You Zhang ◽

Zhiyao Duan

Keyword(s):

Single Image ◽

Face Generation ◽

Talking Face

Download Full-text

Talking Face Generation Based on Information Bottleneck and Complementary Representations

10.1145/3459637.3482198 ◽

2021 ◽

Author(s):

Jie Tang ◽

Yiling Wu ◽

Minglei Li ◽

Zhu Wang

Keyword(s):

Information Bottleneck ◽

Face Generation ◽

Talking Face

Download Full-text

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr.2019.00802 ◽

2019 ◽

Cited By ~ 20

Author(s):

Lele Chen ◽

Ross K. Maddox ◽

Zhiyao Duan ◽

Chenliang Xu

Keyword(s):

Face Generation ◽

Talking Face

Download Full-text

Mining Audio, Text and Visual Information for Talking Face Generation

2019 IEEE International Conference on Data Mining (ICDM) ◽

10.1109/icdm.2019.00089 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lingyun Yu ◽

Jun Yu ◽

Qiang Ling

Keyword(s):

Visual Information ◽

Face Generation ◽

Talking Face

Download Full-text

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Proceedings of the 28th ACM International Conference on Multimedia ◽

10.1145/3394171.3413844 ◽

2020 ◽

Author(s):

Dan Zeng ◽

Han Liu ◽

Hui Lin ◽

Shiming Ge

Keyword(s):

Generative Adversarial Network ◽

Adversarial Network ◽

Face Generation ◽

Talking Face

Download Full-text

Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/327 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hao Zhu ◽

Huaibo Huang ◽

Yi Li ◽

Aihua Zheng ◽

Ran He

Keyword(s):

Input Image ◽

Smooth Transition ◽

Single Image ◽

Facial Motion ◽

Video Information ◽

Training Stage ◽

Face Generation ◽

The Given ◽

Lip Synchronization ◽

Talking Face

Talking face generation aims to synthesize a face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video via the given speech clip and facial image. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, cross-modality coherence between audio and video information has not been well addressed during synthesis. In this paper, we propose a novel arbitrary talking face generation framework by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization. Experimental results on benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods on prevalent metrics with robust high-resolution synthesizing on gender and pose variations.

Download Full-text

Fine-grained talking face generation with video reinterpretation

The Visual Computer ◽

10.1007/s00371-020-01982-7 ◽

2020 ◽

Author(s):

Xin Huang ◽

Mingjie Wang ◽

Minglun Gong

Keyword(s):

Fine Grained ◽

Face Generation ◽

Talking Face

Download Full-text