Abstract

AbstractSurveillance video analysis plays an important role in the modern security systems. Since the manual analysis becomes tedious, automatic analytics of the surveillance videos are preferred, hence, it finds its own place in the research area of video processing. This paper presents a meaningful text generation algorithm by detecting the most salient or relevant area/objects in the frames and considering the spatial, temporal and semantic relationship between the objects present in the video. First, the input video is split up into segments and each segment is fragmented into individual frames. Each of these frames is undergone object detection to extract the details contained in the frame as well as the spatial relationship between various objects in it. These details together with the temporal features of the video segments are utilized for generating textual representations of the segment using the transformer model in natural language processing (NLP). The developed algorithm is applied for the analysis of three surveillance video data sets: BOSS, AVENUE and VISIOCITY, for proving its effectiveness by generating good textual descriptions that highlights the semantics in the video. It also scores a good BLEU4 and CIDEr value of 40.8 and 48.1, respectively, for MSR-VTT data set.KeywordsVideo captioningAutomatic surveillance video analysisText generation

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call