Connectionist Temporal Classification Model for Dynamic Hand Gesture Recognition using RGB and Optical flow Data

2020 ◽  
Vol 17 (4) ◽  
pp. 497-506
Author(s):  
Sunil Patel ◽  
Ramji Makwana

Automatic classification of dynamic hand gesture is challenging due to the large diversity in a different class of gesture, Low resolution, and it is performed by finger. Due to a number of challenges many researchers focus on this area. Recently deep neural network can be used for implicit feature extraction and Soft Max layer is used for classification. In this paper, we propose a method based on a two-dimensional convolutional neural network that performs detection and classification of hand gesture simultaneously from multimodal Red, Green, Blue, Depth (RGBD) and Optical flow Data and passes this feature to Long-Short Term Memory (LSTM) recurrent network for frame-to-frame probability generation with Connectionist Temporal Classification (CTC) network for loss calculation. We have calculated an optical flow from Red, Green, Blue (RGB) data for getting proper motion information present in the video. CTC model is used to efficiently evaluate all possible alignment of hand gesture via dynamic programming and check consistency via frame-to-frame for the visual similarity of hand gesture in the unsegmented input stream. CTC network finds the most probable sequence of a frame for a class of gesture. The frame with the highest probability value is selected from the CTC network by max decoding. This entire CTC network is trained end-to-end with calculating CTC loss for recognition of the gesture. We have used challenging Vision for Intelligent Vehicles and Applications (VIVA) dataset for dynamic hand gesture recognition captured with RGB and Depth data. On this VIVA dataset, our proposed hand gesture recognition technique outperforms competing state-of-the-art algorithms and gets an accuracy of 86%

Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 2106 ◽  
Author(s):  
Linchu Yang ◽  
Ji’an Chen ◽  
Weihang Zhu

Dynamic hand gesture recognition is one of the most significant tools for human–computer interaction. In order to improve the accuracy of the dynamic hand gesture recognition, in this paper, a two-layer Bidirectional Recurrent Neural Network for the recognition of dynamic hand gestures from a Leap Motion Controller (LMC) is proposed. In addition, based on LMC, an efficient way to capture the dynamic hand gestures is identified. Dynamic hand gestures are represented by sets of feature vectors from the LMC. The proposed system has been tested on the American Sign Language (ASL) datasets with 360 samples and 480 samples, and the Handicraft-Gesture dataset, respectively. On the ASL dataset with 360 samples, the system achieves accuracies of 100% and 96.3% on the training and testing sets. On the ASL dataset with 480 samples, the system achieves accuracies of 100% and 95.2%. On the Handicraft-Gesture dataset, the system achieves accuracies of 100% and 96.7%. In addition, 5-fold, 10-fold, and Leave-One-Out cross-validation are performed on these datasets. The accuracies are 93.33%, 94.1%, and 98.33% (360 samples), 93.75%, 93.5%, and 98.13% (480 samples), and 88.66%, 90%, and 92% on ASL and Handicraft-Gesture datasets, respectively. The developed system demonstrates similar or better performance compared to other approaches in the literature.


The dynamic hand gesture is an essential and important research topic in human-computer interaction. Recently, Deep convolutional neural network gives excellent performance in this area and gets promising results. But the Researcher had focused less attention on the feature extraction process, unification of frame, various fusion scheme and sequence-to-sequence prediction of a frame. Therefore, in this paper, we have presented an effective 2D CNN architecture with three stream networks and advances weighted feature fusion scheme with the gated recurrent network for dynamic hand gesture recognition. To obtain enough and useful information we have converted each RGB-D video to 30-frame and 45-frame for input. We have calculated an optical flow for frame-to-frame by given RGB video and extract dense motion features. After finding proper motion path, we have assigned more weight to optical flow features and fuse this information to the next stage and gets a comparable result. We have also added a newest Gated recurrent network for temporal recognition of frame and minimize training time with improved accuracy. Our proposed architecture gives 85% accuracy on the standard VIVA dataset


Sign in / Sign up

Export Citation Format

Share Document