Video localized caption generation framework for industrial videos

Khushboo Khurana,Umesh Deshpande

doi:10.3233/jifs-212381

Abstract

In this information age, there is exponential growth in visual content and video captioning can address many real-life applications. Automatic generation of video captions can be beneficial to comprehend a video in a short time, assist in faster information retrieval, video analysis, indexing, report generation, etc. Captioning of industrial videos is of importance to get a visual and textual summary of the work ongoing in the industry. The generated captioned summary of the video can assist in remote monitoring of industries and these captions can be utilized for video question-answering, video segment extraction, productivity analysis, etc. Due to the presence of diverse events processing of industrial videos are more challenging compared to other domains. In this paper, we address the real-life application of generating the descriptions for the videos of a labor-intensive industry. We propose a keyframe-based approach for the generation of video captions. The framework produces a video summary by extraction of keyframes, thereby reducing the video captioning task to image captioning. These keyframes are passed to the image captioning model for description generation. Utilizing these individual frame captions, multi-caption descriptions of a video are generated with a unique start and end time of each caption. For image captioning, a merge encoder-decoder model with a stacked decoder for caption generation is used. We have performed experimentation on a dataset specifically created for the small-scale industry. We have also shown that data augmentation on the small dataset can greatly benefit the generation of remarkably good video descriptions. Results of extensive experimentation performed by utilizing different image encoders, language encoders, and decoders in the merge encoder-decoder model are reported. Apart from presenting the results on domain-specific data, results on domain-independent datasets are also presented to show the applicability of the technique in general. Performance comparison with existing datasets - OVSD and Flickr8k and Flickr30k are reported to demonstrate the scalability of our method.

Full Text