Abstract

Deep learning based object detectors require thousands of diversified bounding box and class annotated examples. Though image object detectors have shown rapid progress in recent years with release of multiple large scale static image datasets, object detection on videos still remains an open problem due to unavailability of annotated video frames. Having a robust video object detector is an essential component for video understanding and curating large scale automated annotations in videos. Domain difference between images and videos makes the transferability of image object detectors to videos sub-optimal. The most common solution is to use weakly supervised annotations where a video frame has to be tagged for presence/absence of object categories. This still takes up manual effort. In this paper we take a step forward to attain zero supervision on video domain by adapting the concept of unsupervised adversarial image-to-image translation to perturb static high quality images to be visually indistinguishable from set of video frmes. We assume the presence of a fully annotated static image dataset and an unannotated video frames. Object detector is trained on adversarially transformed image dataset using the annotations of original dataset. Experiments on Youtube-Objects and Youtube-Objects-Subset datasets with two contemporary baseline object detectors reveal that such unsupervised pixel level domain adaptation boosts the generalization performance on video frames compared to direct application of image object detector. Also we achieve competitive performance compared to recent baselines of weakly supervised methods. This paper can be seen as an application of image translation for cross domain object detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call