Abstract

Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information which can not be captured by frame-level video representations. In this paper, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the parameter-free motion attention module with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call