Identification/recognition of assault, fighting, shooting, and vandalism from video sequence using deep 2D and 3D convolutional neural networks (CNNs) is explored in this paper. Recent wave of extensive unrestricted urbanization has not only uplifted the standard of living, but has also threatened the safety of a common man leading to an extraordinary rise in crime rate. Although Closed-circuit television (CCTV) footage provides a monitoring framework, yet, it’s useless without an auto volume crime detection system. The system proposed in this work is an effort to eradicate volume crimes through accurate detection in real-time. Firstly, a fine-grained annotated dataset including instance and activity information has been developed for real-world volume crimes. Secondly, a comparison between 3D CNN and 2D CNN network has been presented to identify the malicious event from the video sequence. This is carried out to explore the significance of spatial and temporal information present in the video for event recognition. It has been observed that 2D CNN even with lesser parameters achieved a promising classification accuracy of 91.2%and Area under the curve (AUC) of 95.2%on four classes. The system also reduces false alarm rate in comparison to state-of-the-art approaches.