Abstract

In recent years, computer vision has been getting a lot of attention. Big data is being generated from surveillance videos. Analysis of this video data in real-time helps in getting the gist of events occurring. Video captioning is the task of generating the descriptions of the video. The transformer model has shown remarkable accuracy on machine translation tasks. Visual transformers have shown great accuracy in the classification of videos. Using the existing method to extract the entity, motion, and textual features from a given video. These extracted features are later used for generating the captions. The custom dataset used here is based on the military domain. These captions are further analyzed to detect suspicious activity in real-time. This paper is the implementation of the video captioning model for the military-specific domain. The generative model is based on a transformer model combined with PWC (pyramid, wrap, cost volume) architecture pre-trained on the Kinetic dataset and pre-trained Glove embeddings. We will be using custom military-specific captions. The detailed results from the video captioning experiment are presented here. The captions generated from the test image are evaluated. The BLEU (bilingual evaluation understudy) and METEOR scores are evaluated.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call