Multi-sentence video captioning using spatial saliency of video frames and content-oriented beam search algorithm

Masoomeh Nabati,Alireza Behrad

doi:10.1016/j.eswa.2023.120454

Abstract

Video captioning algorithms aim at expressing the information and activities contained in a video clip in the form of lingual sentences. Most existing video captioning approaches have used only one sentence to describe the semantic content of a video. However, one sentence cannot transfer all the semantic information of a video, especially in videos with high informative content. Although a few studies have been conducted for multi-sentence video captioning, such as paragraph and dense captioning, they produce several sentences by focusing on different activities, objects, or temporal parts of a video. However, a video clip with a single object or activity may include a lot of information from different perspectives that can not be described by a single sentence, effectively. To counter the problem, we propose a multi-sentence video captioning algorithm using the spatial saliency of video frames as well as a content-oriented beam search algorithm. In the proposed algorithm, the spatial saliency of video frames is employed during the encoding stage to generate informative sentences by focusing on different parts of video frames. Furthermore, a content-oriented beam search algorithm is employed during the decoding stage to generate informative sentences. A multi-stage filter is also employed to remove the sentences with incorrect structure or sentences that are less relevant to the semantic content of the video. To evaluate the performance of the proposed algorithm, two well-known video description databases were used, and the results showed a significant improvement in the evaluation metrics, especially in the best-1 sentences. We also tested the proposed algorithm with several real-life movies.

Full Text