One-shot video object detection is a task that aims to locate and identify objects in video sequences given only a single video sample for each class. Exploration in this field is still in its infancy and previous few-shot video object detection methods have limitation in this task. In this paper, we propose the Self-supervised Feature Enhancement (SFE) framework to address one-shot video object detection task. SFE includes two important modules: Hybrid Spatial Self-supervised Feature Enhancement (HSSFE) and Dynamic Temporal Self-supervised Feature Enhancement (DTSFE). HSSFE enhances features from a spatial perspective with spatial self-supervised auxiliary tasks at frame and instance levels. DTSFE on the other hand enhances features from a temporal perspective with memory-based self-supervised constraint for the same object across different frames. We have conducted experiments on multiple benchmarks, and the results demonstrate that our method achieves state-of-the-art performance.
Read full abstract