scholarly journals Talking Face Generation by Conditional Recurrent Adversarial Network

Author(s):  
Yang Song ◽  
Jingwen Zhu ◽  
Dawei Li ◽  
Andy Wang ◽  
Hairong Qi

Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generate the talking face video with accurate lip synchronization. Existing works either do not consider temporal dependency across video frames thus yielding abrupt facial and lip movement or are limited to the generation of talking face video for a specific person thus lacking generalization capacity. We propose a novel conditional recurrent generation network that incorporates both image and audio features in the recurrent unit for  temporal dependency. To achieve both image- and video-realism, a pair of spatial-temporal discriminators are included in the network for better image/video quality. Since accurate lip synchronization is essential to the success of talking face video generation, we also construct a lip-reading discriminator to boost the accuracy of lip synchronization. We also extend the network to model the natural pose and expression of talking face on the Obama Dataset. Extensive experimental results demonstrate the superiority of our framework over the state-of-the-arts in terms of visual quality, lip sync accuracy, and smooth transition pertaining to both lip and facial movement.

Author(s):  
Hao Zhu ◽  
Huaibo Huang ◽  
Yi Li ◽  
Aihua Zheng ◽  
Ran He

Talking face generation aims to synthesize a face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video via the given speech clip and facial image. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, cross-modality coherence between audio and video information has not been well addressed during synthesis. In this paper, we propose a novel arbitrary talking face generation framework by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization. Experimental results on benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods on prevalent metrics with robust high-resolution synthesizing on gender and pose variations.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Chunyu Li ◽  
Lei Wang

Along with the urban renewal and development, the urban living environment has given rise to various problems that need to be solved. With an eye on the future development model of residential communities, an experimental preliminary design for the construction of architectural space, public space, and landscape space based on people’s actual needs is carried out in an attempt to alleviate the more urgent symbiotic relationship between people and urban environment. To this end, this paper proposes a planning and design generation framework for the constructed external spatial environment of building groups based on a recursive double-adversarial network model. Firstly, we extract the features of the constructed external spatial environment of the building group in depth and generate the expression feature map, which is used as a supervisory signal to generate an expression seed image of the constructed external spatial environment of the building group; then we use the generated seed image together with the constructed external spatial environment of the original target building group as the input to generate a feature-holding image as the output of the current frame, and the feature-holding image is also used as the input for the next. Finally, the seed image generation network and the feature-holding image generation network are recursively used to generate the next frame, and the video sequence of the expressions of the constructed external spatial environment of the building group with the same feature-holding expressions as the original input is recursively obtained several times. The experimental results on the building group database show that the proposed method can generate clear and natural video frames of the constructed external spatial environment of the building group, which can be gradually derived from the design of building units to the construction of the building group and penetrate into the planning and design of the external spatial environment in order to comprehensively improve the living environment of urban population and provide a design method and theoretical support for the design of future urban residential communities.


2020 ◽  
Vol 9 (3) ◽  
pp. 1015-1023 ◽  
Author(s):  
Muhammad Fuad ◽  
Ferda Ernawan

Steganography is a technique of concealing the message in multimedia data. Multimedia data, such as videos are often compressed to reduce the storage for limited bandwidth. The video provides additional hidden-space in the object motion of image sequences. This research proposes a video steganography scheme based on object motion and DCT-psychovisual for concealing the message. The proposed hiding technique embeds a secret message along the object motion of the video frames. Motion analysis is used to determine the embedding regions. The proposed scheme selects six DCT coefficients in the middle frequency using DCT-psychovisual effects of hiding messages. A message is embedded by modifying middle DCT coefficients using the proposed algorithm. The middle frequencies have a large hiding capacity and it relatively does not give significant effect to the video reconstruction. The performance of the proposed video steganography is evaluated in terms of video quality and robustness against MPEG compression. The experimental results produce minimum distortion of the video quality. Our scheme produces a robust of hiding messages against MPEG-4 compression with average NC value of 0.94. The proposed video steganography achieves less perceptual distortion to human eyes and it's resistant against reducing video storage.


Electronics ◽  
2019 ◽  
Vol 8 (11) ◽  
pp. 1370 ◽  
Author(s):  
Tingzhu Sun ◽  
Weidong Fang ◽  
Wei Chen ◽  
Yanxin Yao ◽  
Fangming Bi ◽  
...  

Although image inpainting based on the generated adversarial network (GAN) has made great breakthroughs in accuracy and speed in recent years, they can only process low-resolution images because of memory limitations and difficulty in training. For high-resolution images, the inpainted regions become blurred and the unpleasant boundaries become visible. Based on the current advanced image generation network, we proposed a novel high-resolution image inpainting method based on multi-scale neural network. This method is a two-stage network including content reconstruction and texture detail restoration. After holding the visually believable fuzzy texture, we further restore the finer details to produce a smoother, clearer, and more coherent inpainting result. Then we propose a special application scene of image inpainting, that is, to delete the redundant pedestrians in the image and ensure the reality of background restoration. It involves pedestrian detection, identifying redundant pedestrians and filling in them with the seemingly correct content. To improve the accuracy of image inpainting in the application scene, we proposed a new mask dataset, which collected the characters in COCO dataset as a mask. Finally, we evaluated our method on COCO and VOC dataset. the experimental results show that our method can produce clearer and more coherent inpainting results, especially for high-resolution images, and the proposed mask dataset can produce better inpainting results in the special application scene.


2021 ◽  
Vol 8 ◽  
Author(s):  
Rodrigo F. Cádiz ◽  
Agustín Macaya ◽  
Manuel Cartagena ◽  
Denis Parra

Deep learning, one of the fastest-growing branches of artificial intelligence, has become one of the most relevant research and development areas of the last years, especially since 2012, when a neural network surpassed the most advanced image classification techniques of the time. This spectacular development has not been alien to the world of the arts, as recent advances in generative networks have made possible the artificial creation of high-quality content such as images, movies or music. We believe that these novel generative models propose a great challenge to our current understanding of computational creativity. If a robot can now create music that an expert cannot distinguish from music composed by a human, or create novel musical entities that were not known at training time, or exhibit conceptual leaps, does it mean that the machine is then creative? We believe that the emergence of these generative models clearly signals that much more research needs to be done in this area. We would like to contribute to this debate with two case studies of our own: TimbreNet, a variational auto-encoder network trained to generate audio-based musical chords, and StyleGAN Pianorolls, a generative adversarial network capable of creating short musical excerpts, despite the fact that it was trained with images and not musical data. We discuss and assess these generative models in terms of their creativity and we show that they are in practice capable of learning musical concepts that are not obvious based on the training data, and we hypothesize that these deep models, based on our current understanding of creativity in robots and machines, can be considered, in fact, creative.


Sign in / Sign up

Export Citation Format

Share Document