Abstract

Video data are witnessing exponential growth, and extracting summarized information is challenging. It is always necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval requirements. The aim of video summarization (VS) is to extract the most important contents from video repositories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent networks to achieve VS. However, generating the desired summaries can become challenging due to the limited representativeness of extracted features and a lack of consideration for feature refinement. In this article, we introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine multi-scale features and can predict an importance score for each frame. The proposed network comprises four main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from various layers separately and processed individually to support multi-scale progressive feature fusion and refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal refinement block is employed to refine the multi-level feature set before predicting the importance scores. Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the performance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art alternatives by 0.9% and 0.5%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call