scholarly journals StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Author(s):  
Dongliang He ◽  
Zhichao Zhou ◽  
Chuang Gan ◽  
Fu Li ◽  
Xiao Liu ◽  
...  

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

2021 ◽  
Vol 11 (11) ◽  
pp. 4940
Author(s):  
Jinsoo Kim ◽  
Jeongho Cho

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.


Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5312
Author(s):  
Yanni Zhang ◽  
Yiming Liu ◽  
Qiang Li ◽  
Jianzhong Wang ◽  
Miao Qi ◽  
...  

Recently, deep learning-based image deblurring and deraining have been well developed. However, most of these methods fail to distill the useful features. What is more, exploiting the detailed image features in a deep learning framework always requires a mass of parameters, which inevitably makes the network suffer from a high computational burden. We propose a lightweight fusion distillation network (LFDN) for image deblurring and deraining to solve the above problems. The proposed LFDN is designed as an encoder–decoder architecture. In the encoding stage, the image feature is reduced to various small-scale spaces for multi-scale information extraction and fusion without much information loss. Then, a feature distillation normalization block is designed at the beginning of the decoding stage, which enables the network to distill and screen valuable channel information of feature maps continuously. Besides, an information fusion strategy between distillation modules and feature channels is also carried out by the attention mechanism. By fusing different information in the proposed approach, our network can achieve state-of-the-art image deblurring and deraining results with a smaller number of parameters and outperform the existing methods in model complexity.


Author(s):  
Qiang Yu ◽  
Feiqiang Liu ◽  
Long Xiao ◽  
Zitao Liu ◽  
Xiaomin Yang

Deep-learning (DL)-based methods are of growing importance in the field of single image super-resolution (SISR). The practical application of these DL-based models is a remaining problem due to the requirement of heavy computation and huge storage resources. The powerful feature maps of hidden layers in convolutional neural networks (CNN) help the model learn useful information. However, there exists redundancy among feature maps, which can be further exploited. To address these issues, this paper proposes a lightweight efficient feature generating network (EFGN) for SISR by constructing the efficient feature generating block (EFGB). Specifically, the EFGB can conduct plain operations on the original features to produce more feature maps with parameters slightly increasing. With the help of these extra feature maps, the network can extract more useful information from low resolution (LR) images to reconstruct the desired high resolution (HR) images. Experiments conducted on the benchmark datasets demonstrate that the proposed EFGN can outperform other deep-learning based methods in most cases and possess relatively lower model complexity. Additionally, the running time measurement indicates the feasibility of real-time monitoring.


2021 ◽  
Vol 11 (12) ◽  
pp. 5563
Author(s):  
Jinsol Ha ◽  
Joongchol Shin ◽  
Hasil Park ◽  
Joonki Paik

Action recognition requires the accurate analysis of action elements in the form of a video clip and a properly ordered sequence of the elements. To solve the two sub-problems, it is necessary to learn both spatio-temporal information and the temporal relationship between different action elements. Existing convolutional neural network (CNN)-based action recognition methods have focused on learning only spatial or temporal information without considering the temporal relation between action elements. In this paper, we create short-term pixel-difference images from the input video, and take the difference images as an input to a bidirectional exponential moving average sub-network to analyze the action elements and their temporal relations. The proposed method consists of: (i) generation of RGB and differential images, (ii) extraction of deep feature maps using an image classification sub-network, (iii) weight assignment to extracted feature maps using a bidirectional, exponential, moving average sub-network, and (iv) late fusion with a three-dimensional convolutional (C3D) sub-network to improve the accuracy of action recognition. Experimental results show that the proposed method achieves a higher performance level than existing baseline methods. In addition, the proposed action recognition network takes only 0.075 seconds per action class, which guarantees various high-speed or real-time applications, such as abnormal action classification, human–computer interaction, and intelligent visual surveillance.


2021 ◽  
pp. 1-13
Author(s):  
R. Bhuvaneswari ◽  
S. Ganesh Vaidyanathan

Diabetic Retinopathy (DR) is one of the most common diabetic diseases that affect the retina’s blood vessels. Too much of the glucose level in blood leads to blockage of blood vessels in the retina, weakening and damaging the retina. Automatic classification of diabetic retinopathy is a challenging task in medical research. This work proposes a Mixture of Ensemble Classifiers (MEC) to classify and grade diabetic retinopathy images using hierarchical features. We use an ensemble of classifiers such as support vector machine, random forest, and Adaboost classifiers that use the hierarchical feature maps obtained at every pooling layer of a convolutional neural network (CNN) for training. The feature maps are generated by applying the filters to the output of the previous layer. Lastly, we predict the class label or the grade for the given test diabetic retinopathy image by considering the class labels of all the ensembled classifiers. We have tested our approaches on the E-ophtha dataset for the classification task and the Messidor dataset for the grading task. We achieved an accuracy of 95.8% and 96.2% for the E-ophtha and Messidor datasets, respectively. A comparison among prominent convolutional neural network architectures and the proposed approach is provided.


Data ◽  
2020 ◽  
Vol 5 (4) ◽  
pp. 104
Author(s):  
Ashok Sarabu ◽  
Ajit Kumar Santra

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.


2020 ◽  
Vol 2020 (17) ◽  
pp. 4-1-4-7
Author(s):  
Jisu Kim ◽  
Deokwoo Lee

Activity recognition and pose estimation are ingeneral closely related in practical applications, even though they are considered to be independent tasks. In this paper, we propose an artificial 3D coordinates and CNN that is for combining activity recognition and pose estimation with 2D and 3D static/dynamic images(dynamic images are composed of a set of video frames). In other words, We show that the proposed algorithm can be used to solve two problems, activity recognition and pose estimation. End-to-end optimization process has shown that the proposed approach is superior to the one which exploits the activity recognition and pose estimation seperately. The performance is evaluated by calculating recognition rate. The proposed approach enable us to perform learning procedures using different datasets.


2014 ◽  
Vol 599-601 ◽  
pp. 1571-1574
Author(s):  
Jia Ding ◽  
Yang Yi ◽  
Ze Min Qiu ◽  
Jun Shi Liu

Human action recognition in videos plays an important role in the field of computer vision and image understanding. A novel method of multi-channel bag of visual words and multiple kernel learning is proposed in this paper. The videos are described by multi-channel bag of visual words, and a multiple kernel learning classifier is used for action classification, in which each kernel function of the classifier corresponds to a video channel in order to avoid the noise interference from other channels. The proposed approach improves the ability in distinguishing easily confused actions. Experiments on KTH show that the presented method achieves remarkable performance on the average recognition rate, and obtains comparable recognition rate with state-of-the-art methods.


2017 ◽  
Author(s):  
Ghislain St-Yves ◽  
Thomas Naselaris

AbstractWe introduce the feature-weighted receptive field (fwRF), an encoding model designed to balance expressiveness, interpretability and scalability. The fwRF is organized around the notion of a feature map—a transformation of visual stimuli into visual features that preserves the topology of visual space (but not necessarily the native resolution of the stimulus). The key assumption of the fwRF model is that activity in each voxel encodes variation in a spatially localized region across multiple feature maps. This region is fixed for all feature maps; however, the contribution of each feature map to voxel activity is weighted. Thus, the model has two separable sets of parameters: “where” parameters that characterize the location and extent of pooling over visual features, and “what” parameters that characterize tuning to visual features. The “where” parameters are analogous to classical receptive fields, while “what” parameters are analogous to classical tuning functions. By treating these as separable parameters, the fwRF model complexity is independent of the resolution of the underlying feature maps. This makes it possible to estimate models with thousands of high-resolution feature maps from relatively small amounts of data. Once a fwRF model has been estimated from data, spatial pooling and feature tuning can be read-off directly with no (or very little) additional post-processing or in-silico experimentation.We describe an optimization algorithm for estimating fwRF models from data acquired during standard visual neuroimaging experiments. We then demonstrate the model’s application to two distinct sets of features: Gabor wavelets and features supplied by a deep convolutional neural network. We show that when Gabor feature maps are used, the fwRF model recovers receptive fields and spatial frequency tuning functions consistent with known organizational principles of the visual cortex. We also show that a fwRF model can be used to regress entire deep convolutional networks against brain activity. The ability to use whole networks in a single encoding model yields state-of-the-art prediction accuracy. Our results suggest a wide variety of uses for the feature-weighted receptive field model, from retinotopic mapping with natural scenes, to regressing the activities of whole deep neural networks onto measured brain activity.


Sign in / Sign up

Export Citation Format

Share Document