Any kind of violence is a shame to the world we live in. However, violence still plays a significant role in our society today and claims many innocent lives. Using a gun is one of the traditional methods of violence. Nowadays, firearm-related fatalities are a widespread occurrence. It poses a risk to public safety and presents difficulties for law enforcement. Most of these incidents take place in cities or semi-urban regions. Today, CCTV-based surveillance is widely used by both public and commercial institutions for monitoring and prevention. However, human-based monitoring is error-prone and needs a large number of person-hours as a resource. However, automated smart surveillance is better suited for large-scale and dependable monitoring of violent actions. The primary goal of the study is to demonstrate how deep learning-based methods may be used to identify weaponry, especially rifles. This research employs many detection methods for human face and gun identification, including the newest EfficientDet-based architectures and Faster Region-Based Convolutional Neural Networks (Faster RCNN). The post-processing detection performance of weapons and human faces has been enhanced using an ensemble (stacked) strategy that employs Non-Maximum Suppression, Non-Maximum Weighted, and Weighted Box Fusion techniques. In this study, the comparative outcomes of several detection methods and their ensembles have been addressed experimentally. It facilitates the police's rapid information gathering about the situation and enables them to take prompt preventative action. Additionally, social media videos may be found using the same method for gun-based content recognition. Here, the mean average precisions for the mAP0.5, mAP0.75, and mAP[0.500.95] are 77.02%, 16.40%, and 29.73%, respectively, according to the Weighted Box Fusion-based Ensemble Detection Scheme. Out of all the options tested, the outcomes provide the best performance. The model has undergone extensive testing using movie clips and unidentified test photos. The resulting ensemble schemes are consistently better than the basic models and are adequate.