Extracting Dynamic Facial Expressions from Naturalistic Videos
Facial expressions are critical in our daily interactions. Studying how humans recognize dynamic facial expressions is an important area of research in social perception, but advancements are hampered by the difficulty of creating well-controlled stimuli. Research on the perception of static faces has made significant progress thanks to techniques that make it possible to generate synthetic face stimuli. However, synthetic dynamic expressions are more difficult to generate; methods that yield realistic dynamics typically rely on the use of infrared markers applied on the face, making it expensive to create datasets that include large numbers of different expressions. In addition, the use of markers might interfere with facial dynamics. In this paper, we contribute a new method to generate large amounts of realistic and well-controlled facial expression videos. We use a deep convolutional neural network with attention and asymmetric loss to extract the dynamics of action units from videos, and demonstrate that this approach outperforms a baseline model based on convolutional neural networks without attention on the same stimuli. Next, we develop a pipeline to use the action unit dynamics to render realistic synthetic videos. This pipeline makes it possible to generate large scale naturalistic and controllable facial expression datasets to facilitate future research in social cognitive science.