Abstract

Video summarization has attracted extensive attention benefiting from its valuable capability to facilitate video browsing. While achieving notable improvement, existing methods still fail to sufficiently and effectively model contextual information within videos, hindering the summarization performance owing to a deficiency in powerful contextual representations. To address this limitation, we present a novel Attention-Guided Multi-Granularity Fusion Model (AMFM), which allows for optimizing the modeling process from the context capturing and fusion perspective. AMFM comprises three dominant components including a content-aware enhancement (CAE) module, a multi-granularity encoder (MGE), and a scale-adaptive fusion (SAF) module. More specifically, CAE dynamically enhances pre-trained visual features by learning the potential visual relationship across frame-level and video-level embeddings. Subsequently, coarse-grained and fine-grained contextual information is simultaneously modeled in the same representation space by MGE with the combination of self-attention and temporal convolution scheme. Furthermore, the multi-granularity representations with a significant difference in the semantic scale are adaptively fused by SAF. Our method can precisely pinpoint key segments by effectively modeling and processing rich temporal representations. Extensive comparisons with state-of-the-art methods on standard datasets demonstrate the effectiveness of the proposed method, and the ablation studies further verify the positive impact of each module in our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call