Abstract
In the domain of activity-based image-to-video retrieval, dynamically consistent semantics are crucial for effective cross-modal search tasks. Existing methods face significant challenges, particularly in addressing the issue of modality asymmetry, where images and videos exhibit differing semantic representations. A key solution to this challenge lies in enhancing the learning capacity of the image encoder by leveraging knowledge from video data. To this end, we propose a Cross-Modal Knowledge Transfer (CMKT) framework that improves the behavior modeling capability of the image encoder. This enhancement is achieved through both global and local information transmission: globally, the model assimilates rich semantic information from videos across a broad temporal spectrum, while locally, it captures semantics from frames closely resembling the query image. Specifically, we design the Global Temporal Structure Transmission (GTST) Model to ensure temporal distribution consistency between query image objects and video content. Additionally, the Local Temporal Relation Enhancement (LRTE) Module is introduced to pinpoint the most relevant action information within the video. We evaluate the effectiveness of our method on two widely adopted action recognition datasets, THUMOS14 and ActivityNet, and provide comprehensive ablation studies to substantiate the efficacy of our approach.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have