Drone examination has been overall quickly embraced by NDMM (natural disaster mitigation and management) division to survey the state of impacted regions. Manual video analysis by human observers takes time and is subject to mistakes. The human identification examination of pictures caught by drones will give a practical method for saving lives who are being trapped under debris during quakes or in floods and so on. Drone investigation for research and security and search and rescue (SAR) should involve the drone to filter the impacted area using a camera and a model of unmanned area vehicles (UAVs) to identify specific locations where assistance is required. The existing methods (Balmukund et al. 2020) used were faster-region based convolutional neural networks (F-RCNNs), single shot detector (SSD), and region-based fully convolutional network (R-FCN) for the detection of human and recognition of action. Some of the existing methods used 700 images with six classes only, whereas the proposed model uses 1996 images with eight classes. The proposed model is used YOLOv3 (you only look once) algorithm for the detection and recognition of actions. In this study, we provide the fundamental ideas underlying an object detection model. To find the most effective model for human recognition and detection, we trained the YOLOv3 algorithm on the image dataset and evaluated its performance. We compared the outcomes with the existing algorithms like F-RCNN, SSD, and R-FCN. The accuracies of F-RCNN, SSD, R-FCN (existing algorithms), and YOLOv3 (proposed algorithm) are 53%, 73%, 93%, and 94.9%, respectively. Among these algorithms, the YOLOv3 algorithm gives the highest accuracy of 94.9%. The proposed work shows that existing models are inadequate for critical applications like search and rescue, which convinces us to propose a model raised by a pyramidal component extracting SSD in human localization and action recognition. The suggested model is 94.9% accurate when applied to the proposed dataset, which is an important contribution. Likewise, the suggested model succeeds in helping time for expectation in examination with the cutting-edge identification models with existing strategies. The average time taken by our proposed technique to distinguish a picture is 0.40 milisec which is a lot better than the existing method. The proposed model can likewise distinguish video and can be utilized for real-time recognition. The SSD model can likewise use to anticipate messages if present in the picture.