Violence Detection Algorithm Based on Local Spatio-temporal Features and Optical Flow

Author(s):  
Yao Lyu ◽  
Yingyun Yang
Author(s):  
Qing Xia ◽  
Ping Zhang ◽  
JingJing Wang ◽  
Ming Tian ◽  
Chun Fei

Author(s):  
Mujtaba Asad ◽  
He Jiang ◽  
Jie Yang ◽  
Enmei Tu ◽  
Aftab A. Malik

Detection of violent human behavior is necessary for public safety and monitoring. However, it demands constant human observation and attention in human-based surveillance systems, which is a challenging task. Autonomous detection of violent human behavior is therefore essential for continuous uninterrupted video surveillance. In this paper, we propose a novel method for violence detection and localization in videos using the fusion of spatio-temporal features and attention model. The model consists of Fusion Convolutional Neural Network (Fusion-CNN), spatio-temporal attention modules and Bi-directional Convolutional LSTMs (BiConvLSTM). The Fusion-CNN learns both spatial and temporal features by combining multi-level inter-layer features from both RGB and Optical flow input frames. The spatial attention module is used to generate an importance mask to focus on the most important areas of the image frame. The temporal attention part, which is based on BiConvLSTM, identifies the most significant video frames which are related to violent activity. The proposed model can also localize and discriminate prominent regions in both spatial and temporal domains, given the weakly supervised training with only video-level classification labels. Experimental results evaluated on different publicly available benchmarking datasets show the superior performance of the proposed model in comparison with the existing methods. Our model achieves the improved accuracies (ACC) of 89.1%, 99.1% and 98.15% for RWF-2000, HockeyFight and Crowd-Violence datasets, respectively. For CCTV-FIGHTS dataset, we choose the mean average precision (mAp) performance metric and our model obtained 80.7% mAp.


Electronics ◽  
2021 ◽  
Vol 10 (13) ◽  
pp. 1601
Author(s):  
Fernando J. Rendón-Segador ◽  
Juan A. Álvarez-García ◽  
Fernando Enríquez ◽  
Oscar Deniz

Introducing efficient automatic violence detection in video surveillance or audiovisual content monitoring systems would greatly facilitate the work of closed-circuit television (CCTV) operators, rating agencies or those in charge of monitoring social network content. In this paper we present a new deep learning architecture, using an adapted version of DenseNet for three dimensions, a multi-head self-attention layer and a bidirectional convolutional long short-term memory (LSTM) module, that allows encoding relevant spatio-temporal features, to determine whether a video is violent or not. Furthermore, an ablation study of the input frames, comparing dense optical flow and adjacent frames subtraction and the influence of the attention layer is carried out, showing that the combination of optical flow and the attention mechanism improves results up to 4.4%. The conducted experiments using four of the most widely used datasets for this problem, matching or exceeding in some cases the results of the state of the art, reducing the number of network parameters needed (4.5 millions), and increasing its efficiency in test accuracy (from 95.6% on the most complex dataset to 100% on the simplest one) and inference time (less than 0.3 s for the longest clips). Finally, to check if the generated model is able to generalize violence, a cross-dataset analysis is performed, which shows the complexity of this approach: using three datasets to train and testing on the remaining one the accuracy drops in the worst case to 70.08% and in the best case to 81.51%, which points to future work oriented towards anomaly detection in new datasets.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 2052
Author(s):  
Xinghai Yang ◽  
Fengjiao Wang ◽  
Zhiquan Bai ◽  
Feifei Xun ◽  
Yulin Zhang ◽  
...  

In this paper, a deep learning-based traffic state discrimination method is proposed to detect traffic congestion at urban intersections. The detection algorithm includes two parts, global speed detection and a traffic state discrimination algorithm. Firstly, the region of interest (ROI) is selected as the road intersection from the input image of the You Only Look Once (YOLO) v3 object detection algorithm for vehicle target detection. The Lucas-Kanade (LK) optical flow method is employed to calculate the vehicle speed. Then, the corresponding intersection state can be obtained based on the vehicle speed and the discrimination algorithm. The detection of the vehicle takes the position information obtained by YOLOv3 as the input of the LK optical flow algorithm and forms an optical flow vector to complete the vehicle speed detection. Experimental results show that the detection algorithm can detect the vehicle speed and traffic state discrimination method can judge the traffic state accurately, which has a strong anti-interference ability and meets the practical application requirements.


2021 ◽  
Author(s):  
Monir Torabian ◽  
Hossein Pourghassem ◽  
Homayoun Mahdavi-Nasab

2020 ◽  
Vol 34 (07) ◽  
pp. 10713-10720
Author(s):  
Mingyu Ding ◽  
Zhe Wang ◽  
Bolei Zhou ◽  
Jianping Shi ◽  
Zhiwu Lu ◽  
...  

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame of a video clip is annotated, which makes most supervised methods fail to utilize information from the rest of the frames. To exploit the spatio-temporal information in videos, many previous works use pre-computed optical flows, which encode the temporal consistency to improve the video segmentation. However, the video segmentation and optical flow estimation are still considered as two separate tasks. In this paper, we propose a novel framework for joint video semantic segmentation and optical flow estimation. Semantic segmentation brings semantic information to handle occlusion for more robust optical flow estimation, while the non-occluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, our framework is able to utilize both labeled and unlabeled frames in the video through joint training, while no additional calculation is required in inference. Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks.


Sign in / Sign up

Export Citation Format

Share Document