Multi-scale spatiotemporal information deep fusion network with temporal pyramid mechanism for video action recognition

2021 ◽  
pp. 1-13
Author(s):  
Hongshi Ou ◽  
Jifeng Sun

In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. This paper puts forward a network extracting video sequence semantic information based on deep integration of local Spatial-Temporal information. The network uses 2D Convolutional Neural Network (2DCNN) and Multi Spatial-Temporal scale 3D Convolutional Neural Network (MST_3DCNN) respectively to extract spatial information and motion information. Spatial information and motion information of the same time quantum receive 3D convolutional integration to generate the temporary Spatial-Temporal information of a certain moment. Then, the Spatial-Temporal information of multiple single moments enters Temporal Pyramid Net (TPN) to generate the local Spatial-Temporal information of multiple time scales. Finally, bidirectional recurrent neutral network is used to act on the Spatial-Temporal information of all parts so as to acquire the context information spanning the length of the entire video, which endows the network with video context information extraction capability. Through the experiments on the three video action recognitio common experimental data sets UCF101, UCF11, UCFSports, the Spatial-Temporal information deep fusion network proposed in this paper has a high correct recognition rate in the task of video action recognitio.

2018 ◽  
Vol 4 (9) ◽  
pp. 107 ◽  
Author(s):  
Mohib Ullah ◽  
Ahmed Mohammed ◽  
Faouzi Alaya Cheikh

Articulation modeling, feature extraction, and classification are the important components of pedestrian segmentation. Usually, these components are modeled independently from each other and then combined in a sequential way. However, this approach is prone to poor segmentation if any individual component is weakly designed. To cope with this problem, we proposed a spatio-temporal convolutional neural network named PedNet which exploits temporal information for spatial segmentation. The backbone of the PedNet consists of an encoder–decoder network for downsampling and upsampling the feature maps, respectively. The input to the network is a set of three frames and the output is a binary mask of the segmented regions in the middle frame. Irrespective of classical deep models where the convolution layers are followed by a fully connected layer for classification, PedNet is a Fully Convolutional Network (FCN). It is trained end-to-end and the segmentation is achieved without the need of any pre- or post-processing. The main characteristic of PedNet is its unique design where it performs segmentation on a frame-by-frame basis but it uses the temporal information from the previous and the future frame for segmenting the pedestrian in the current frame. Moreover, to combine the low-level features with the high-level semantic information learned by the deeper layers, we used long-skip connections from the encoder to decoder network and concatenate the output of low-level layers with the higher level layers. This approach helps to get segmentation map with sharp boundaries. To show the potential benefits of temporal information, we also visualized different layers of the network. The visualization showed that the network learned different information from the consecutive frames and then combined the information optimally to segment the middle frame. We evaluated our approach on eight challenging datasets where humans are involved in different activities with severe articulation (football, road crossing, surveillance). The most common CamVid dataset which is used for calculating the performance of the segmentation algorithm is evaluated against seven state-of-the-art methods. The performance is shown on precision/recall, F 1 , F 2 , and mIoU. The qualitative and quantitative results show that PedNet achieves promising results against state-of-the-art methods with substantial improvement in terms of all the performance metrics.


Author(s):  
Benhui Xia ◽  
Dezhi Han ◽  
Ximing Yin ◽  
Gao Na

To secure cloud computing and outsourced data while meeting the requirements of automation, many intrusion detection schemes based on deep learn ing are proposed. Though the detection rate of many network intrusion detection solutions can be quite high nowadays, their identification accuracy on imbalanced abnormal network traffic still remains low. Therefore, this paper proposes a ResNet &Inception-based convolutional neural network (RICNN) model to abnormal traffic classification. RICNN can learn more traffic features through the Inception unit, and the degradation problem of the network is eliminated through the direct map ping unit of ResNet, thus the improvement of the model?s generalization ability can be achievable. In addition, to simplify the network, an improved version of RICNN, which makes it possible to reduce the number of parameters that need to be learnt without degrading identification accuracy, is also proposed in this paper. The experimental results on the dataset CICIDS2017 show that RICNN not only achieves an overall accuracy of 99.386% but also has a high detection rate across different categories, especially for small samples. The comparison experiments show that the recognition rate of RICNN outperforms a variety of CNN models and RNN models, and the best detection accuracy can be achieved.


2020 ◽  
Author(s):  
Florian Dupuy ◽  
Olivier Mestre ◽  
Léo Pfitzner

<p>Cloud cover is a crucial information for many applications such as planning land observation missions from space. However, cloud cover remains a challenging variable to forecast, and Numerical Weather Prediction (NWP) models suffer from significant biases, hence justifying the use of statistical post-processing techniques. In our application, the ground truth is a gridded cloud cover product derived from satellite observations over Europe, and predictors are spatial fields of various variables produced by ARPEGE (Météo-France global NWP) at the corresponding lead time.</p><p>In this study, ARPEGE cloud cover is post-processed using a convolutional neural network (CNN). CNN is the most popular machine learning tool to deal with images. In our case, CNN allows to integrate spatial information contained in NWP outputs. We show that a simple U-Net architecture produces significant improvements over Europe. Compared to the raw ARPEGE forecasts, MAE drops from 25.1 % to 17.8 % and RMSE decreases from 37.0 % to 31.6 %. Considering specific needs for earth observation, special interest was put on forecasts with low cloud cover conditions (< 10 %). For this particular nebulosity class, we show that hit rate jumps from 40.6 to 70.7 (which is the order of magnitude of what can be achieved using classical machine learning algorithms such as random forests) while false alarm decreases from 38.2 to 29.9. This is an excellent result, since improving hit rates by means of random forests usually also results in a slight increase of false alarms.</p>


2020 ◽  
Vol 17 (4) ◽  
pp. 572-578
Author(s):  
Mohammad Parseh ◽  
Mohammad Rahmanimanesh ◽  
Parviz Keshavarzi

Persian handwritten digit recognition is one of the important topics of image processing which significantly considered by researchers due to its many applications. The most important challenges in Persian handwritten digit recognition is the existence of various patterns in Persian digit writing that makes the feature extraction step to be more complicated.Since the handcraft feature extraction methods are complicated processes and their performance level are not stable, most of the recent studies have concentrated on proposing a suitable method for automatic feature extraction. In this paper, an automatic method based on machine learning is proposed for high-level feature extraction from Persian digit images by using Convolutional Neural Network (CNN). After that, a non-linear multi-class Support Vector Machine (SVM) classifier is used for data classification instead of fully connected layer in final layer of CNN. The proposed method has been applied to HODA dataset and obtained 99.56% of recognition rate. Experimental results are comparable with previous state-of-the-art methods


2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Yu Wang ◽  
Xiaofei Wang ◽  
Junfan Jian

Landslides are a type of frequent and widespread natural disaster. It is of great significance to extract location information from the landslide in time. At present, most articles still select single band or RGB bands as the feature for landslide recognition. To improve the efficiency of landslide recognition, this study proposed a remote sensing recognition method based on the convolutional neural network of the mixed spectral characteristics. Firstly, this paper tried to add NDVI (normalized difference vegetation index) and NIRS (near-infrared spectroscopy) to enhance the features. Then, remote sensing images (predisaster and postdisaster images) with same spatial information but different time series information regarding landslide are taken directly from GF-1 satellite as input images. By combining the 4 bands (red + green + blue + near-infrared) of the prelandslide remote sensing images with the 4 bands of the postlandslide images and NDVI images, images with 9 bands were obtained, and the band values reflecting the changing characteristics of the landslide were determined. Finally, a deep learning convolutional neural network (CNN) was introduced to solve the problem. The proposed method was tested and verified with remote sensing data from the 2015 large-scale landslide event in Shanxi, China, and 2016 large-scale landslide event in Fujian, China. The results showed that the accuracy of the method was high. Compared with the traditional methods, the recognition efficiency was improved, proving the effectiveness and feasibility of the method.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Jie Shen ◽  
Mengxi Xu ◽  
Xinyu Du ◽  
Yunbo Xiong

Video surveillance is an important data source of urban computing and intelligence. The low resolution of many existing video surveillance devices affects the efficiency of urban computing and intelligence. Therefore, improving the resolution of video surveillance is one of the important tasks of urban computing and intelligence. In this paper, the resolution of video is improved by superresolution reconstruction based on a learning method. Different from the superresolution reconstruction of static images, the superresolution reconstruction of video is characterized by the application of motion information. However, there are few studies in this area so far. Aimed at fully exploring motion information to improve the superresolution of video, this paper proposes a superresolution reconstruction method based on an efficient subpixel convolutional neural network, where the optical flow is introduced in the deep learning network. Fusing the optical flow features between successive frames can compensate for information in frames and generate high-quality superresolution results. In addition, in order to improve the superresolution, a superpixel convolution layer is added after the deep convolution network. Finally, experimental evaluations demonstrate the satisfying performance of our method compared with previous methods and other deep learning networks; our method is more efficient.


2020 ◽  
Vol 9 (2) ◽  
pp. 74
Author(s):  
Eric Hsueh-Chan Lu ◽  
Jing-Mei Ciou

With the rapid development of surveying and spatial information technologies, more and more attention has been given to positioning. In outdoor environments, people can easily obtain positioning services through global navigation satellite systems (GNSS). In indoor environments, the GNSS signal is often lost, while other positioning problems, such as dead reckoning and wireless signals, will face accumulated errors and signal interference. Therefore, this research uses images to realize a positioning service. The main concept of this work is to establish a model for an indoor field image and its coordinate information and to judge its position by image eigenvalue matching. Based on the architecture of PoseNet, the image is input into a 23-layer convolutional neural network according to various sizes to train end-to-end location identification tasks, and the three-dimensional position vector of the camera is regressed. The experimental data are taken from the underground parking lot and the Palace Museum. The preliminary experimental results show that this new method designed by us can effectively improve the accuracy of indoor positioning by about 20% to 30%. In addition, this paper also discusses other architectures, field sizes, camera parameters, and error corrections for this neural network system. The preliminary experimental results show that the angle error correction method designed by us can effectively improve positioning by about 20%.


Sign in / Sign up

Export Citation Format

Share Document