Abstract

The objective of video summarization is to develop a concise and condensed summary that accurately captures the original video content. The methods currently used to summarize supervised videos and consider the task a sequence-to-sequence problem. However, modeling the order of long videos presents three challenges: (1) capturing both local and global relationships simultaneously is challenging; (2) the boundaries of video highlight segments are often incorrectly located, indicating that semantic integrity is incomplete; (3) efficient relation computing is difficult to do well. We design a novel coarse-to-fine network (C2F) for video summarization adapted to the multi-level semantic video structure, thus addressing these limitations. The multiscale representation scheme initially captures different scales of temporal relationships for the coarse classification results; Meanwhile, the action-wise proposal module is intended to provide the fine prediction of importance scores and regress the temporal locations of key-frames. In addition, a loss function is proposed to identify local differences among frames and analyze combinations of various loss functions. Extensive experimental results on two benchmark datasets have demonstrated that the proposed C2F achieves significant performance compared with state-of-the-art methods, and performs satisfactorily in efficient relation computing. For example, on the TVSum dataset, we improve the F-score from 69.4% to 72.8% by 3.4%. Furthermore, C2F includes 4.7 M parameters, accounting for only 10.7% of the parameters used in the SASUM model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call