Abstract

We address the task of unsupervised domain adaptation (UDA) for videos with self-supervised learning. While UDA for images is a widely studied problem, UDA for videos is relatively unexplored. In this paper, we propose a novel self-supervised loss for the task of video UDA. The method is motivated by inverted reasoning. Many works on video classification have shown success with representations based on events in videos, e.g., ‘reaching’, ‘picking’, and ‘drinking’ events for ‘drinking coffee’. We argue that if we have event-based representations, we should be able to predict the relative distances between clips in videos. Inverting that, we propose a self-supervised task to predict the difference of the distance between two clips from the source video and the distance between two clips from the target video. We hope that such a task would encourage learning event-based representations of the videos, which is known to be beneficial for classification. Since we predict the difference of clip distances between clips from source videos and target videos, we ‘tie’ the two domains and expect to achieve well-adapted representations. We combine this purely self-supervised loss and the source classification loss to learn the model parameters. We give extensive empirical results on challenging video UDA benchmarks, i.e., UCF-HMDB and EPIC-Kitchens. The presented qualitative and quantitative results support our motivations and method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call