Joint learning of video scene detection and annotation via multi-modal adaptive context network

Yifei Xu,Litong Pan,Weiguang Sang,Hailun Luo,Li Li,Pingping Wei,Li Zhu

doi:10.1016/j.eswa.2024.123656

Yifei Xu, Litong Pan + Show 5 more

https://doi.org/10.1016/j.eswa.2024.123656

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The tasks of scene detection and annotation have gained impressive attention for understanding video content. The main challenges lie in mitigating the error propagation of shot detection, recognizing cuts and gradual transitions, fusing hierarchical multi-modal cues, and solving these two tasks simultaneously. To address these challenges, we propose the Multi-modal Adaptive Context Network (MACN) to jointly learn scene detection and annotation from a window partitioning perspective. As a shared task-agnostic part, we perform Window-based Cross-modal Representation (WCR) to distill complex semantic correlations from multi-modal sources for each window. Considering the long-term temporal dependency of variable-length scenes, we further develop Adaptive Context-aware Representation (ACR) to improve the performance for specific tasks. Different from previous works, scene detection is formulated as locating the starting window and its associated location offset and transition duration. Meanwhile, we assemble two multi-label sub-classifiers in different levels to predict the labels for each scene candidate. Experimental comparisons to state-of-the-art algorithms on the TAVS and ClipShots indicate that the proposed method yields promising performance in both tasks. Our code and test sample videos are released at MACN.

Full Text