SFD: Similar Frame Dataset for Content-Based Video Retrieval
Content-based video retrieval aims to retrieve near-duplicate entries from a database of a given query video. It plays an important role in combating video piracy. Robustness to video temporal dynamics is crucial for a representation model in video retrieval, as frames extracted from two copied videos are hardly temporally aligned in actual situations. However, current image retrieval datasets have difficulty in evaluating this robustness. To address this issue, we collect Similar Frame Dataset (SFD), which consists of 32,923 query-target pairs with 128,240 distraction images. The task of SFD is to retrieve the target frame from all items given a query frame. SFD is constructed by sampling frames from Kinetics-700 action classification dataset. An object detection model (Faster R-CNN) and a Multimodal Large Language Model (BLIP2) are used during sampling to select those valid frames. Besides, we propose Adjacent Frames Contrastive Learning (AFCL) framework. In AFCL, adjacent frames are sampled from unlabeled videos as positive pairs. An image representation model with robustness to changing frames can be trained under AFCL framework and achieve the state-of-the-art performance on SFD. The code will be released at https://github.com/Chuan-shanjia/Similar-Frame-Dataset.