Abstract
The tasks of scene detection and annotation have gained impressive attention for understanding video content. The main challenges lie in mitigating the error propagation of shot detection, recognizing cuts and gradual transitions, fusing hierarchical multi-modal cues, and solving these two tasks simultaneously. To address these challenges, we propose the Multi-modal Adaptive Context Network (MACN) to jointly learn scene detection and annotation from a window partitioning perspective. As a shared task-agnostic part, we perform Window-based Cross-modal Representation (WCR) to distill complex semantic correlations from multi-modal sources for each window. Considering the long-term temporal dependency of variable-length scenes, we further develop Adaptive Context-aware Representation (ACR) to improve the performance for specific tasks. Different from previous works, scene detection is formulated as locating the starting window and its associated location offset and transition duration. Meanwhile, we assemble two multi-label sub-classifiers in different levels to predict the labels for each scene candidate. Experimental comparisons to state-of-the-art algorithms on the TAVS and ClipShots indicate that the proposed method yields promising performance in both tasks. Our code and test sample videos are released at MACN.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have