Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization

Author(s):  
Mujtaba Asad ◽  
He Jiang ◽  
Jie Yang ◽  
Enmei Tu ◽  
Aftab A. Malik

Detection of violent human behavior is necessary for public safety and monitoring. However, it demands constant human observation and attention in human-based surveillance systems, which is a challenging task. Autonomous detection of violent human behavior is therefore essential for continuous uninterrupted video surveillance. In this paper, we propose a novel method for violence detection and localization in videos using the fusion of spatio-temporal features and attention model. The model consists of Fusion Convolutional Neural Network (Fusion-CNN), spatio-temporal attention modules and Bi-directional Convolutional LSTMs (BiConvLSTM). The Fusion-CNN learns both spatial and temporal features by combining multi-level inter-layer features from both RGB and Optical flow input frames. The spatial attention module is used to generate an importance mask to focus on the most important areas of the image frame. The temporal attention part, which is based on BiConvLSTM, identifies the most significant video frames which are related to violent activity. The proposed model can also localize and discriminate prominent regions in both spatial and temporal domains, given the weakly supervised training with only video-level classification labels. Experimental results evaluated on different publicly available benchmarking datasets show the superior performance of the proposed model in comparison with the existing methods. Our model achieves the improved accuracies (ACC) of 89.1%, 99.1% and 98.15% for RWF-2000, HockeyFight and Crowd-Violence datasets, respectively. For CCTV-FIGHTS dataset, we choose the mean average precision (mAp) performance metric and our model obtained 80.7% mAp.

Author(s):  
Kaixuan Chen ◽  
Lina Yao ◽  
Dalin Zhang ◽  
Bin Guo ◽  
Zhiwen Yu

Multi-modality is an important feature of sensor based activity recognition. In this work, we consider two inherent characteristics of human activities, the spatially-temporally varying salience of features and the relations between activities and corresponding body part motions. Based on these, we propose a multi-agent spatial-temporal attention model. The spatial-temporal attention mechanism helps intelligently select informative modalities and their active periods. And the multiple agents in the proposed model represent activities with collective motions across body parts by independently selecting modalities associated with single motions. With a joint recognition goal, the agents share gained information and coordinate their selection policies to learn the optimal recognition model. The experimental results on four real-world datasets demonstrate that the proposed model outperforms the state-of-the-art methods.


2020 ◽  
Vol 79 (37-38) ◽  
pp. 28329-28354
Author(s):  
Dong Huang ◽  
Zhaoqiang Xia ◽  
Joshua Mwesigye ◽  
Xiaoyi Feng

2019 ◽  
Vol 8 (11) ◽  
pp. 512 ◽  
Author(s):  
Li ◽  
Wu ◽  
Wu ◽  
Zhao

Spatio-temporal indexing is a key technique in spatio-temporal data storage and management. Indexing methods based on spatial filling curves are popular in research on the spatio-temporal indexing of vector data in the Not Relational (NoSQL) database. However, the existing methods mostly focus on spatial indexing, which makes it difficult to balance the efficiencies of time and space queries. In addition, for non-point elements (line and polygon elements), it remains difficult to determine the optimal index level. To address these issues, this paper proposes an adaptive construction method of hierarchical spatio-temporal index for vector data. Firstly, a joint spatio-temporal information coding based on the combination of the partition and sort key strategies is presented. Secondly, the multilevel expression structure of spatio-temporal elements consisting of point and non-point elements in the joint coding is given. Finally, an adaptive multi-level index tree is proposed to realize the spatio-temporal index (Multi-level Sphere 3, MLS3) based on the spatio-temporal characteristics of geographical entities. Comparison with the XZ3 index algorithm proposed by GeoMesa proved that the MLS3 indexing method not only reasonably expresses the spatio-temporal features of non-point elements and determines their optimal index level, but also avoids storage hotspots while achieving spatio-temporal retrieval with high efficiency.


Sensors ◽  
2019 ◽  
Vol 19 (23) ◽  
pp. 5142 ◽  
Author(s):  
Dong Liang ◽  
Jiaxing Pan ◽  
Han Sun ◽  
Huiyu Zhou

Foreground detection is an important theme in video surveillance. Conventional background modeling approaches build sophisticated temporal statistical model to detect foreground based on low-level features, while modern semantic/instance segmentation approaches generate high-level foreground annotation, but ignore the temporal relevance among consecutive frames. In this paper, we propose a Spatio-Temporal Attention Model (STAM) for cross-scene foreground detection. To fill the semantic gap between low and high level features, appearance and optical flow features are synthesized by attention modules via the feature learning procedure. Experimental results on CDnet 2014 benchmarks validate it and outperformed many state-of-the-art methods in seven evaluation metrics. With the attention modules and optical flow, its F-measure increased 9 % and 6 % respectively. The model without any tuning showed its cross-scene generalization on Wallflower and PETS datasets. The processing speed was 10.8 fps with the frame size 256 by 256.


2020 ◽  
Vol 37 (5) ◽  
pp. 687-701
Author(s):  
Wafa Lejmi ◽  
Anouar Ben Khalifa ◽  
Mohamed Ali Mahjoub

In the current era, the implementation of automated security video surveillance systems is particularly needy in terms of human violence recognition. Nevertheless, the latter encounters various interlinked difficulties which require efficient solutions as well as feasible methods that provide a relevant distinction between normal human actions and abnormal ones. In this paper, we present an overview of these issues and a literature review of the related works and current research on-going efforts on this field and suggests a novel prediction model for violence recognition, based on a preliminary spatio-temporal features extraction using the material derivative which describes the rate of change of a particle while in motion with respect to time. The classification algorithm is then carried out using a deep learning LSTM technique to classify generated features into eight specified violent and non-violent categories and a prediction value for each class of action is calculated. The whole model is trained on a public dataset and its classification capacity is evaluated on a confusion matrix which assembles all the predictions made by the system with their actual labels. The obtained results are promising and show that the proposed model can be potentially useful for detecting human violence.


Author(s):  
Amirhessam Tahmassebi ◽  
Katja Pinker-Domenig ◽  
Anke Meyer-Baese ◽  
Antonio Garcia ◽  
Diego P. Morales ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document