scholarly journals Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition

Author(s):  
Wissam J. Baddar ◽  
Yong Man Ro

Spatio-temporal feature encoding is essential for encoding the dynamics in video sequences. Recurrent neural networks, particularly long short-term memory (LSTM) units, have been popular as an efficient tool for encoding spatio-temporal features in sequences. In this work, we investigate the effect of mode variations on the encoded spatio-temporal features using LSTMs. We show that the LSTM retains information related to the mode variation in the sequence, which is irrelevant to the task at hand (e.g. classification facial expressions). Actually, the LSTM forget mechanism is not robust enough to mode variations and preserves information that could negatively affect the encoded spatio-temporal features. We propose the mode variational LSTM to encode spatio-temporal features robust to unseen modes of variation. The mode variational LSTM modifies the original LSTM structure by adding an additional cell state that focuses on encoding the mode variation in the input sequence. To efficiently regulate what features should be stored in the additional cell state, additional gating functionality is also introduced. The effectiveness of the proposed mode variational LSTM is verified using the facial expression recognition task. Comparative experiments on publicly available datasets verified that the proposed mode variational LSTM outperforms existing methods. Moreover, a new dynamic facial expression dataset with different modes of variation, including various modes like pose and illumination variations, was collected to comprehensively evaluate the proposed mode variational LSTM. Experimental results verified that the proposed mode variational LSTM encodes spatio-temporal features robust to unseen modes of variation.

2013 ◽  
Vol 347-350 ◽  
pp. 3780-3785
Author(s):  
Jing Jie Yan ◽  
Ming Han Xin

Although spatio-temporal features (ST) have recently been developed and shown to be available for facial expression recognition and behavior recognition in videos, it utilizes the method of directly flattening the cuboid into a vector as a feature vector for recognition which causes the obtained vector is likely to potentially sensitive to small cuboid perturbations or noises. To overcome the drawback of spatio-temporal features, we propose a novel method called fused spatio-temporal features (FST) method utilizing the separable linear filters to detect interesting points and fusing two cuboids representation methods including local histogrammed gradient descriptor and flattening the cuboid into a vector for cuboids descriptor. The proposed FST method may robustness to small cuboid perturbations or noises and also preserve both spatial and temporal positional information. The experimental results on two video-based facial expression databases demonstrate the effectiveness of the proposed method.


2021 ◽  
Vol 2 (4) ◽  
pp. 1-26
Author(s):  
Peining Zhen ◽  
Hai-Bao Chen ◽  
Yuan Cheng ◽  
Zhigang Ji ◽  
Bin Liu ◽  
...  

Mobile devices usually suffer from limited computation and storage resources, which seriously hinders them from deep neural network applications. In this article, we introduce a deeply tensor-compressed long short-term memory (LSTM) neural network for fast video-based facial expression recognition on mobile devices. First, a spatio-temporal facial expression recognition LSTM model is built by extracting time-series feature maps from facial clips. The LSTM-based spatio-temporal model is further deeply compressed by means of quantization and tensorization for mobile device implementation. Based on datasets of Extended Cohn-Kanade (CK+), MMI, and Acted Facial Expression in Wild 7.0, experimental results show that the proposed method achieves 97.96%, 97.33%, and 55.60% classification accuracy and significantly compresses the size of network model up to 221× with reduced training time per epoch by 60%. Our work is further implemented on the RK3399Pro mobile device with a Neural Process Engine. The latency of the feature extractor and LSTM predictor can be reduced 30.20× and 6.62× , respectively, on board with the leveraged compression methods. Furthermore, the spatio-temporal model costs only 57.19 MB of DRAM and 5.67W of power when running on the board.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 2003 ◽  
Author(s):  
Xiaoliang Zhu ◽  
Shihao Ye ◽  
Liang Zhao ◽  
Zhicheng Dai

As a sub-challenge of EmotiW (the Emotion Recognition in the Wild challenge), how to improve performance on the AFEW (Acted Facial Expressions in the wild) dataset is a popular benchmark for emotion recognition tasks with various constraints, including uneven illumination, head deflection, and facial posture. In this paper, we propose a convenient facial expression recognition cascade network comprising spatial feature extraction, hybrid attention, and temporal feature extraction. First, in a video sequence, faces in each frame are detected, and the corresponding face ROI (range of interest) is extracted to obtain the face images. Then, the face images in each frame are aligned based on the position information of the facial feature points in the images. Second, the aligned face images are input to the residual neural network to extract the spatial features of facial expressions corresponding to the face images. The spatial features are input to the hybrid attention module to obtain the fusion features of facial expressions. Finally, the fusion features are input in the gate control loop unit to extract the temporal features of facial expressions. The temporal features are input to the fully connected layer to classify and recognize facial expressions. Experiments using the CK+ (the extended Cohn Kanade), Oulu-CASIA (Institute of Automation, Chinese Academy of Sciences) and AFEW datasets obtained recognition accuracy rates of 98.46%, 87.31%, and 53.44%, respectively. This demonstrated that the proposed method achieves not only competitive performance comparable to state-of-the-art methods but also greater than 2% performance improvement on the AFEW dataset, proving the significant outperformance of facial expression recognition in the natural environment.


2010 ◽  
Author(s):  
M. Fischer-Shofty ◽  
S. G. Shamay-Tsoorya ◽  
H. Harari ◽  
Y. Levkovitz

Symmetry ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 52 ◽  
Author(s):  
Xianzhang Pan ◽  
Wenping Guo ◽  
Xiaoying Guo ◽  
Wenshu Li ◽  
Junjie Xu ◽  
...  

The proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies.


2005 ◽  
Vol 50 (9) ◽  
pp. 525-533 ◽  
Author(s):  
Benoit Bediou ◽  
Pierre Krolak-Salmon ◽  
Mohamed Saoud ◽  
Marie-Anne Henaff ◽  
Michael Burt ◽  
...  

Background: Impaired facial expression recognition in schizophrenia patients contributes to abnormal social functioning and may predict functional outcome in these patients. Facial expression processing involves individual neural networks that have been shown to malfunction in schizophrenia. Whether these patients have a selective deficit in facial expression recognition or a more global impairment in face processing remains controversial. Objective: To investigate whether patients with schizophrenia exhibit a selective impairment in facial emotional expression recognition, compared with patients with major depression and healthy control subjects. Methods: We studied performance in facial expression recognition and facial sex recognition paradigms, using original morphed faces, in a population with schizophrenia ( n = 29) and compared their scores with those of depression patients ( n = 20) and control subjects ( n = 20). Results: Schizophrenia patients achieved lower scores than both other groups in the expression recognition task, particularly in fear and disgust recognition. Sex recognition was unimpaired. Conclusion: Facial expression recognition is impaired in schizophrenia, whereas sex recognition is preserved, which highly suggests an abnormal processing of changeable facial features in this disease. A dysfunction of the top-down retrograde modulation coming from limbic and paralimbic structures on visual areas is hypothesized.


Sign in / Sign up

Export Citation Format

Share Document