scholarly journals 3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition

2019 ◽  
Vol 11 (2) ◽  
pp. 42 ◽  
Author(s):  
Sheeraz Arif ◽  
Jing Wang ◽  
Tehseen Ul Hassan ◽  
Zesong Fei

Human activity recognition is an active field of research in computer vision with numerous applications. Recently, deep convolutional networks and recurrent neural networks (RNN) have received increasing attention in multimedia studies, and have yielded state-of-the-art results. In this research work, we propose a new framework which intelligently combines 3D-CNN and LSTM networks. First, we integrate discriminative information from a video into a map called a ‘motion map’ by using a deep 3-dimensional convolutional network (C3D). A motion map and the next video frame can be integrated into a new motion map, and this technique can be trained by increasing the training video length iteratively; then, the final acquired network can be used for generating the motion map of the whole video. Next, a linear weighted fusion scheme is used to fuse the network feature maps into spatio-temporal features. Finally, we use a Long-Short-Term-Memory (LSTM) encoder-decoder for final predictions. This method is simple to implement and retains discriminative and dynamic information. The improved results on benchmark public datasets prove the effectiveness and practicability of the proposed method.

Information ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 3
Author(s):  
Shuang Chen ◽  
Zengcai Wang ◽  
Wenxin Chen

The effective detection of driver drowsiness is an important measure to prevent traffic accidents. Most existing drowsiness detection methods only use a single facial feature to identify fatigue status, ignoring the complex correlation between fatigue features and the time information of fatigue features, and this reduces the recognition accuracy. To solve these problems, we propose a driver sleepiness estimation model based on factorized bilinear feature fusion and a long- short-term recurrent convolutional network to detect driver drowsiness efficiently and accurately. The proposed framework includes three models: fatigue feature extraction, fatigue feature fusion, and driver drowsiness detection. First, we used a convolutional neural network (CNN) to effectively extract the deep representation of eye and mouth-related fatigue features from the face area detected in each video frame. Then, based on the factorized bilinear feature fusion model, we performed a nonlinear fusion of the deep feature representations of the eyes and mouth. Finally, we input a series of fused frame-level features into a long-short-term memory (LSTM) unit to obtain the time information of the features and used the softmax classifier to detect sleepiness. The proposed framework was evaluated with the National Tsing Hua University drowsy driver detection (NTHU-DDD) video dataset. The experimental results showed that this method had better stability and robustness compared with other methods.


Author(s):  
Sophia Bano ◽  
Francisco Vasconcelos ◽  
Emmanuel Vander Poorten ◽  
Tom Vercauteren ◽  
Sebastien Ourselin ◽  
...  

Abstract Purpose Fetoscopic laser photocoagulation is a minimally invasive surgery for the treatment of twin-to-twin transfusion syndrome (TTTS). By using a lens/fibre-optic scope, inserted into the amniotic cavity, the abnormal placental vascular anastomoses are identified and ablated to regulate blood flow to both fetuses. Limited field-of-view, occlusions due to fetus presence and low visibility make it difficult to identify all vascular anastomoses. Automatic computer-assisted techniques may provide better understanding of the anatomical structure during surgery for risk-free laser photocoagulation and may facilitate in improving mosaics from fetoscopic videos. Methods We propose FetNet, a combined convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network architecture for the spatio-temporal identification of fetoscopic events. We adapt an existing CNN architecture for spatial feature extraction and integrated it with the LSTM network for end-to-end spatio-temporal inference. We introduce differential learning rates during the model training to effectively utilising the pre-trained CNN weights. This may support computer-assisted interventions (CAI) during fetoscopic laser photocoagulation. Results We perform quantitative evaluation of our method using 7 in vivo fetoscopic videos captured from different human TTTS cases. The total duration of these videos was 5551 s (138,780 frames). To test the robustness of the proposed approach, we perform 7-fold cross-validation where each video is treated as a hold-out or test set and training is performed using the remaining videos. Conclusion FetNet achieved superior performance compared to the existing CNN-based methods and provided improved inference because of the spatio-temporal information modelling. Online testing of FetNet, using a Tesla V100-DGXS-32GB GPU, achieved a frame rate of 114 fps. These results show that our method could potentially provide a real-time solution for CAI and automating occlusion and photocoagulation identification during fetoscopic procedures.


Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1458
Author(s):  
Xulong Zhang ◽  
Yi Yu ◽  
Yongwei Gao ◽  
Xi Chen ◽  
Wei Li

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.


2021 ◽  
Vol 2 (4) ◽  
pp. 1-26
Author(s):  
Peining Zhen ◽  
Hai-Bao Chen ◽  
Yuan Cheng ◽  
Zhigang Ji ◽  
Bin Liu ◽  
...  

Mobile devices usually suffer from limited computation and storage resources, which seriously hinders them from deep neural network applications. In this article, we introduce a deeply tensor-compressed long short-term memory (LSTM) neural network for fast video-based facial expression recognition on mobile devices. First, a spatio-temporal facial expression recognition LSTM model is built by extracting time-series feature maps from facial clips. The LSTM-based spatio-temporal model is further deeply compressed by means of quantization and tensorization for mobile device implementation. Based on datasets of Extended Cohn-Kanade (CK+), MMI, and Acted Facial Expression in Wild 7.0, experimental results show that the proposed method achieves 97.96%, 97.33%, and 55.60% classification accuracy and significantly compresses the size of network model up to 221× with reduced training time per epoch by 60%. Our work is further implemented on the RK3399Pro mobile device with a Neural Process Engine. The latency of the feature extractor and LSTM predictor can be reduced 30.20× and 6.62× , respectively, on board with the leveraged compression methods. Furthermore, the spatio-temporal model costs only 57.19 MB of DRAM and 5.67W of power when running on the board.


2020 ◽  
Vol 40 (4) ◽  
pp. 655-662 ◽  
Author(s):  
Xianhe Wen ◽  
Heping Chen

Purpose Human assembly process recognition in human–robot collaboration (HRC) has been studied recently. However, most research works do not cover high-precision and long-timespan sub-assembly recognition. Hence this paper aims to deal with this problem. Design/methodology/approach To deal with the above-mentioned problem, the authors propose a 3D long-term recurrent convolutional networks (LRCN) by combining 3D convolutional neural networks (CNN) with long short-term memory (LSTM). 3D CNN behaves well in human action recognition. But when it comes to human sub-assembly recognition, the accuracy of 3D CNN is very low and the number of model parameters is huge, which limits its application in human sub-assembly recognition. Meanwhile, LSTM has the incomparable superiority of long-time memory and time dimensionality compression ability. Hence, by combining 3D CNN with LSTM, the new approach can greatly improve the recognition accuracy and reduce the number of model parameters. Findings Experiments were performed to validate the proposed method and preferable results have been obtained, where the recognition accuracy increases from 82% to 99%, recall ratio increases from 95% to 100% and the number of model parameters is reduced more than 8 times. Originality/value The authors focus on a new problem of high-precision and long-timespan sub-assembly recognition in the area of human assembly process recognition. Then, the 3D LRCN method is a new method with high-precision and long-timespan recognition ability for human sub-assembly recognition compared to 3D CNN method. It is extraordinarily valuable for the robot in HRC. It can help the robot understand what the sub-assembly human cooperator has done in HRC.


2018 ◽  
Vol 8 (10) ◽  
pp. 1785 ◽  
Author(s):  
Wahyu Wiratama ◽  
Jongseok Lee ◽  
Sang-Eun Park ◽  
Donggyu Sim

This paper presents a robust change detection algorithm for high-resolution panchromatic imagery using a proposed dual-dense convolutional network (DCN). In this work, a joint structure of two deep convolutional networks with dense connectivity in convolution layers is designed in order to accomplish change detection for satellite images acquired at different times. The proposed network model detects pixel-wise temporal change based on local characteristics by incorporating information from neighboring pixels. Dense connection in convolution layers is designed to reuse preceding feature maps by connecting them to all subsequent layers. Dual networks are incorporated by measuring the dissimilarity of two temporal images. In the proposed algorithm for change detection, a contrastive loss function is used in a learning stage by running over multiple pairs of samples. According to our evaluation, we found that the proposed framework achieves better detection performance than conventional algorithms, in area under the curve (AUC) of 0.97, percentage correct classification (PCC) of 99%, and Kappa of 69, on average.


2021 ◽  
Vol 9 (3A) ◽  
Author(s):  
Sheeraz Arif ◽  
◽  
Jing Wang ◽  
Adnan Ahmed Siddiqui ◽  
Rashid Hussain ◽  
...  

Deep convolutional neural network (DCNN) and recurrent neural network (RNN) have been proved as an imperious research area in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bidirectional LSTM. Like popular two-stream convent, we also introduce a two-stream framework with one modification; that is, we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate a saliency-aware video stream by applying the saliency-aware method. Secondly, a two-stream 3D-convolutional network (C3D) is utilized with two different types of streams, i.e., RGB stream and saliency-aware video stream, to collect both spatial and semantic temporal features. Next, a deep bidirectional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling-layer and softmax-layers classify human activity and behavior. The introduced system can learn long-term temporal dependencies and can predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.


Machines ◽  
2019 ◽  
Vol 7 (2) ◽  
pp. 24 ◽  
Author(s):  
Yiwei Fu ◽  
Devesh K. Jha ◽  
Zeyu Zhang ◽  
Zhenyuan Yuan ◽  
Asok Ray

This paper presents and experimentally validates a concept of end-to-end imitation learning for autonomous systems by using a composite architecture of convolutional neural network (ConvNet) and Long Short Term Memory (LSTM) neural network. In particular, a spatio-temporal deep neural network is developed, which learns to imitate the policy used by a human supervisor to drive a car-like robot in a maze environment. The spatial and temporal components of the imitation model are learned by using deep convolutional network and recurrent neural network architectures, respectively. The imitation model learns the policy of a human supervisor as a function of laser light detection and ranging (LIDAR) data, which is then used in real time to drive a robot in an autonomous fashion in a laboratory setting. The performance of the proposed model for imitation learning is compared with that of several other state-of-the-art methods, reported in the machine learning literature, for spatial and temporal modeling. The learned policy is implemented on a robot using a Nvidia Jetson TX2 board which, in turn, is validated on test tracks. The proposed spatio-temporal model outperforms several other off-the-shelf machine learning techniques to learn the policy.


2019 ◽  
Vol 11 (2) ◽  
pp. 159 ◽  
Author(s):  
Bei Fang ◽  
Ying Li ◽  
Haokui Zhang ◽  
Jonathan Chan

Hyperspectral images (HSIs) data that is typically presented in 3-D format offers an opportunity for 3-D networks to extract spectral and spatial features simultaneously. In this paper, we propose a novel end-to-end 3-D dense convolutional network with spectral-wise attention mechanism (MSDN-SA) for HSI classification. The proposed MSDN-SA exploits 3-D dilated convolutions to simultaneously capture the spectral and spatial features at different scales, and densely connects all 3-D feature maps with each other. In addition, a spectral-wise attention mechanism is introduced to enhance the distinguishability of spectral features, which improves the classification performance of the trained models. Experimental results on three HSI datasets demonstrate that our MSDN-SA achieves competitive performance for HSI classification.


2020 ◽  
Vol 18 (S3) ◽  
pp. 34-45
Author(s):  
Zhingtang Zhao ◽  
Qingtao Wu

In intelligent computer-aided video abnormal behavior recognition, pedestrian behavior analysis technology can detect and handle abnormal behaviors in time, which has great practical value in ensuring social safety. We analyze a deep learning video behavior recognition network that has advantages in current research. The network first sparsely sampled the input video to obtain the video frame of each video segment, and then used a two-dimensional convolutional network to extract the characteristics of each video frame, then used a three-dimensional network to fuse them. The method realizes the recognition of long-term and short-term actions in the video at the same time. In order to overcome the shortcoming of the large amount of calculation in the 3D convolution part of the network, this paper proposes an improvement to this module in the network, and proposes a mobile 3D convolution network structure. Aiming at the problem of low utilization of long-term motion features in video sequences, this paper constructs a deep residual module by introducing long and short-term memory networks, residual connection design, etc., to fully and effectively utilize the long-term dynamic features in video sequences. Aiming at the problem of large differences in similar actions and small differences between classes in abnormal behavior videos, this paper proposes a 2CSoftmax function based on double center loss to optimize the network model, which is beneficial to maximize the distance between classes and minimize the distance between classes, so as to realize the classification and recognition of similar actions and improve the recognition accuracy.


Sign in / Sign up

Export Citation Format

Share Document