Security has always been of paramount importance to humans. In the absence of a sense of security at one’s workplace, home or anywhere else, people feel uneasy and vulnerable. With the improvement of modern technology, along with the lack of time at hand, the need for faster, efficient, accurate as well as low-cost security techniques is more than ever. Image Captioning for Video Surveillance System is proposed to develop visual systems that generate contextual descriptions about objects in images, and then use these descriptions to provide information of the scene that needs to be secured. The proposed system uses a neural network model composed of a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to caption the incoming video feed. The main significance of this paper is in integrating the system with Discrete Wavelet Transform (DWT), which is applied on the incoming video feed, so that the compressed LL band frames transferred wirelessly to the model are smaller in comparison, leading to less transfer time and faster processing by the model.