Image Encoder and Sentence Decoder Based Video Event Description Generating Model: A Storytelling

Krishan Kumar,P Nishanth,Maheep Singh,Sanjay Dahiya

doi:10.1080/09747338.2022.2044396

Abstract

In the Multimedia era, the production of video data is massive due to advancements in computing and network infrastructure. To summarize the video events, the entire video needs to be watched, which may take a few hours to summarize events in the video. Manually generating a summary of the video events requires considerable human time and effort. In this work, a six-layer Convolutional Neural Network (CNN) is proposed to characterize images input to the Encoder-Decoder framework. The framework summarizes video events by taking images as input and generates a summary like a natural language rather than machine-generated text. It produces the synopsis in sentences for each sequence of images as output. The quantitative and qualitative analysis shows that the model outperforms the existing state-of-the-art model in terms of Training loss, Metric for Evaluation of Translation with Explicit ORdering (METEOR) score, and BiLingual Evaluation Understudy (BLEU) score. Applications of the proposed model include visual storytelling identifying critical events in the entire video.

Full Text