Abstract

Current few-shot action recognition approaches have achieved impressive performance using only a few labeled examples. However, they usually assume the base (train) and target (test) videos typically come from the same domain, which may limit their further applications. In this paper, we introduce a new practical task, termed as cross-domain few-shot action recognition, and hypothesize there is a domain shift between the base and target videos and the unlabeled target videos are available. To address this task, we further propose a Self-supervised learning Enhanced tEmporal Network (SEEN), which incorporates temporal modeling and self-supervised learning techniques to learn more transferable representations. Concretely, the temporal modeling mechanism aims to learn long-range temporal semantics from the features output by the backbone, and the self-supervised learning focuses on exploring the underlying data patterns to reduce domain shifts under the few-shot setting, which can help to improve the generalization ability. Therefore, the proposed SEEN can capture broader variations of the feature distributions and is more appropriate for the cross-domain few-shot action recognition task. Extensive experiments on multiple cross-domain benchmarks show that our SEEN consistently outperforms several strong baseline methods by a convincing margin.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call