The number of security cameras positioned within the surrounding area has expanded, increasing the demand for automatic activity recognition systems. In addition to offline assessment and the issuance of an ongoing alarm in the case of aberrant behaviour, automatic activity detection systems can be employed in conjunction with human operators. In the proposed research framework, an ensemble of Mask Region-based Convolutional Neural Networks for key-point detection scheme, and LSTM based Recurrent Neural Network is used to create a deep neural network model (Mask RCNN) for recognizing violent activities (i.e. kicking, punching, etc.) of a single person. First of all, the key-points locations and ground-truth masks of humans in an image are selected using the selected region; the temporal information is extracted. Experimental results show that the ensemble model outperforms individual models. The proposed technique has a reasonable accuracy rate of 77.4 percent, 95.7 percent, and 88.2 percent, respectively, on the Weizmann, KTH, and our custom datasets. As the proposed effort applies to industry and in terms of security, it is beneficial to society.