Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition

Spatio-temporal feature encoding is essential for encoding the dynamics in video sequences. Recurrent neural networks, particularly long short-term memory (LSTM) units, have been popular as an efficient tool for encoding spatio-temporal features in sequences. In this work, we investigate the effect of mode variations on the encoded spatio-temporal features using LSTMs. We show that the LSTM retains information related to the mode variation in the sequence, which is irrelevant to the task at hand (e.g. classification facial expressions). Actually, the LSTM forget mechanism is not robust enough to mode variations and preserves information that could negatively affect the encoded spatio-temporal features. We propose the mode variational LSTM to encode spatio-temporal features robust to unseen modes of variation. The mode variational LSTM modifies the original LSTM structure by adding an additional cell state that focuses on encoding the mode variation in the input sequence. To efficiently regulate what features should be stored in the additional cell state, additional gating functionality is also introduced. The effectiveness of the proposed mode variational LSTM is verified using the facial expression recognition task. Comparative experiments on publicly available datasets verified that the proposed mode variational LSTM outperforms existing methods. Moreover, a new dynamic facial expression dataset with different modes of variation, including various modes like pose and illumination variations, was collected to comprehensively evaluate the proposed mode variational LSTM. Experimental results verified that the proposed mode variational LSTM encodes spatio-temporal features robust to unseen modes of variation.

Download Full-text

Facial Expression Recognition Based on Fused Spatio-Temporal Features

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.3780 ◽

2013 ◽

Vol 347-350 ◽

pp. 3780-3785

Author(s):

Jing Jie Yan ◽

Ming Han Xin

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Feature Vector ◽

Positional Information ◽

Expression Recognition ◽

Linear Filters ◽

Temporal Features ◽

Novel Method ◽

Spatio Temporal ◽

And Behavior

Although spatio-temporal features (ST) have recently been developed and shown to be available for facial expression recognition and behavior recognition in videos, it utilizes the method of directly flattening the cuboid into a vector as a feature vector for recognition which causes the obtained vector is likely to potentially sensitive to small cuboid perturbations or noises. To overcome the drawback of spatio-temporal features, we propose a novel method called fused spatio-temporal features (FST) method utilizing the separable linear filters to detect interesting points and fusing two cuboids representation methods including local histogrammed gradient descriptor and flattening the cuboid into a vector for cuboids descriptor. The proposed FST method may robustness to small cuboid perturbations or noises and also preserve both spatial and temporal positional information. The experimental results on two video-based facial expression databases demonstrate the effectiveness of the proposed method.

Download Full-text

Fast Video Facial Expression Recognition by a Deeply Tensor-Compressed LSTM Neural Network for Mobile Devices

ACM Transactions on Internet of Things ◽

10.1145/3464941 ◽

2021 ◽

Vol 2 (4) ◽

pp. 1-26

Author(s):

Peining Zhen ◽

Hai-Bao Chen ◽

Yuan Cheng ◽

Zhigang Ji ◽

Bin Liu ◽

...

Keyword(s):

Neural Network ◽

Facial Expression ◽

Mobile Devices ◽

Mobile Device ◽

Facial Expression Recognition ◽

Short Term Memory ◽

Expression Recognition ◽

Feature Maps ◽

Temporal Model ◽

Spatio Temporal

Mobile devices usually suffer from limited computation and storage resources, which seriously hinders them from deep neural network applications. In this article, we introduce a deeply tensor-compressed long short-term memory (LSTM) neural network for fast video-based facial expression recognition on mobile devices. First, a spatio-temporal facial expression recognition LSTM model is built by extracting time-series feature maps from facial clips. The LSTM-based spatio-temporal model is further deeply compressed by means of quantization and tensorization for mobile device implementation. Based on datasets of Extended Cohn-Kanade (CK+), MMI, and Acted Facial Expression in Wild 7.0, experimental results show that the proposed method achieves 97.96%, 97.33%, and 55.60% classification accuracy and significantly compresses the size of network model up to 221× with reduced training time per epoch by 60%. Our work is further implemented on the RK3399Pro mobile device with a Neural Process Engine. The latency of the feature extractor and LSTM predictor can be reduced 30.20× and 6.62× , respectively, on board with the leveraged compression methods. Furthermore, the spatio-temporal model costs only 57.19 MB of DRAM and 5.67W of power when running on the board.

Download Full-text

Facial Expression Recognition Based on Fused Spatio-temporal Features

Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013) ◽

10.2991/iccsee.2013.532 ◽

2013 ◽

Author(s):

Jingjie Yan ◽

Minghan Xin

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

Hybrid Attention Cascade Network for Facial Expression Recognition

Sensors ◽

10.3390/s21062003 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2003 ◽

Cited By ~ 1

Author(s):

Xiaoliang Zhu ◽

Shihao Ye ◽

Liang Zhao ◽

Zhicheng Dai

Keyword(s):

Facial Expression ◽

Facial Expressions ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Spatial Features ◽

Face Images ◽

Temporal Features ◽

The Face ◽

In The Wild ◽

Fusion Features

As a sub-challenge of EmotiW (the Emotion Recognition in the Wild challenge), how to improve performance on the AFEW (Acted Facial Expressions in the wild) dataset is a popular benchmark for emotion recognition tasks with various constraints, including uneven illumination, head deflection, and facial posture. In this paper, we propose a convenient facial expression recognition cascade network comprising spatial feature extraction, hybrid attention, and temporal feature extraction. First, in a video sequence, faces in each frame are detected, and the corresponding face ROI (range of interest) is extracted to obtain the face images. Then, the face images in each frame are aligned based on the position information of the facial feature points in the images. Second, the aligned face images are input to the residual neural network to extract the spatial features of facial expressions corresponding to the face images. The spatial features are input to the hybrid attention module to obtain the fusion features of facial expressions. Finally, the fusion features are input in the gate control loop unit to extract the temporal features of facial expressions. The temporal features are input to the fully connected layer to classify and recognize facial expressions. Experiments using the CK+ (the extended Cohn Kanade), Oulu-CASIA (Institute of Automation, Chinese Academy of Sciences) and AFEW datasets obtained recognition accuracy rates of 98.46%, 87.31%, and 53.44%, respectively. This demonstrated that the proposed method achieves not only competitive performance comparable to state-of-the-art methods but also greater than 2% performance improvement on the AFEW dataset, proving the significant outperformance of facial expression recognition in the natural environment.

Download Full-text

A spatio-temporal RBM-based model for facial expression recognition

Pattern Recognition ◽

10.1016/j.patcog.2015.07.006 ◽

2016 ◽

Vol 49 ◽

pp. 152-161 ◽

Cited By ~ 36

Author(s):

S. Elaiwat ◽

M. Bennamoun ◽

F. Boussaid

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Spatio Temporal

Download Full-text

Emotional Facial Expression Recognition Task

PsycTESTS Dataset ◽

10.1037/t28423-000 ◽

2010 ◽

Author(s):

M. Fischer-Shofty ◽

S. G. Shamay-Tsoorya ◽

H. Harari ◽

Y. Levkovitz

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Recognition Task ◽

Expression Recognition ◽

Emotional Facial Expression

Download Full-text

Fusing HOG and convolutional neural network spatial–temporal features for video-based facial expression recognition

IET Image Processing ◽

10.1049/iet-ipr.2019.0293 ◽

2020 ◽

Vol 14 (1) ◽

pp. 176-182 ◽

Cited By ~ 3

Author(s):

Xianzhang Pan

Keyword(s):

Neural Network ◽

Facial Expression ◽

Convolutional Neural Network ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Temporal Features

Download Full-text

Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition

Symmetry ◽

10.3390/sym11010052 ◽

2019 ◽

Vol 11 (1) ◽

pp. 52 ◽

Cited By ~ 5

Author(s):

Xianzhang Pan ◽

Wenping Guo ◽

Xiaoying Guo ◽

Wenshu Li ◽

Junjie Xu ◽

...

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

State Of The Art ◽

Spatial Aggregation ◽

Expression Recognition ◽

Temporal Features ◽

Visual Descriptors ◽

Feature Aggregation ◽

Temporal Feature ◽

Video Descriptor

The proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies.

Download Full-text

CBA generated receptive fields implemented in a Facial expression recognition task

Computational Methods in Neural Modeling - Lecture Notes in Computer Science ◽

10.1007/3-540-44868-3_93 ◽

2003 ◽

pp. 734-741

Author(s):

Jose M. Jerez ◽

Leonardo Franco ◽

Ignacio Molina

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Receptive Fields ◽

Recognition Task ◽

Expression Recognition

Download Full-text

Facial Expression and Sex Recognition in Schizophrenia and Depression

The Canadian Journal of Psychiatry ◽

10.1177/070674370505000905 ◽

2005 ◽

Vol 50 (9) ◽

pp. 525-533 ◽

Cited By ~ 53

Author(s):

Benoit Bediou ◽

Pierre Krolak-Salmon ◽

Mohamed Saoud ◽

Marie-Anne Henaff ◽

Michael Burt ◽

...

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Recognition Task ◽

Expression Recognition ◽

Sex Recognition ◽

Control Subjects ◽

Healthy Control ◽

And Control ◽

Expression Processing ◽

Selective Impairment

Background: Impaired facial expression recognition in schizophrenia patients contributes to abnormal social functioning and may predict functional outcome in these patients. Facial expression processing involves individual neural networks that have been shown to malfunction in schizophrenia. Whether these patients have a selective deficit in facial expression recognition or a more global impairment in face processing remains controversial. Objective: To investigate whether patients with schizophrenia exhibit a selective impairment in facial emotional expression recognition, compared with patients with major depression and healthy control subjects. Methods: We studied performance in facial expression recognition and facial sex recognition paradigms, using original morphed faces, in a population with schizophrenia ( n = 29) and compared their scores with those of depression patients ( n = 20) and control subjects ( n = 20). Results: Schizophrenia patients achieved lower scores than both other groups in the expression recognition task, particularly in fear and disgust recognition. Sex recognition was unimpaired. Conclusion: Facial expression recognition is impaired in schizophrenia, whereas sex recognition is preserved, which highly suggests an abnormal processing of changeable facial features in this disease. A dysfunction of the top-down retrograde modulation coming from limbic and paralimbic structures on visual areas is hypothesized.

Download Full-text