Abstract

Video summarization aims to create concise and accurate summary to enable users to quickly grasp the key content of the original video for facilitating efficient video browsing. Most existing video summarization methods mainly employ recurrent neural networks to capture long-term dependencies in videos, yielding remarkable results. Nevertheless, these methods overlook the potential spatial features inside the video when modeling the video. To tackle this issue, we introduce a global–local spatio-temporal graph convolutional networks for video summarization (GL-STGCN). Initially inspired by the concept of 3-D convolution, we segment the video into non-overlapping segments to capture localized spatial features of sequential frames. Subsequently, a spatial graph is constructed for each segment, with a fixed time interval between neighboring spatial graphs. Then we use the pooling method to randomly delete the redundant nodes in the graph. We then leverage a temporal gating convolutional network to extract the global temporal relationships within the video. Employing the spatial features, a spatial graph convolutional network is utilized to capture the spatial connections among frames. As the graph node information evolves, the node features furnish a more precise depiction of the video content. Consequently, we employ the temporal gating convolutional network once more to refine the global temporal relations within the video. Extensive experiments on two public datasets are conducted in this paper, showing that our proposed method outperforms most state-of-the-art video summarization methods in terms of performance. Experimental results demonstrate the effectiveness of integrating global temporal and local spatial relationships.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call