Abstract
We propose a novel few-shot learning framework that can recognize human–object interaction (HOI) classes with a few labeled samples. We achieve this by leveraging a meta-learning paradigm where human–object interactions are embedded into compact features for similarity calculation. More specifically, spatial and temporal relationships of HOI in videos are constructed with transformers which boost the performance over the baseline significantly. First, we present a spatial encoder that extracts the spatial context and infers frame-level features of a human and objects in each frame. And then the video-level feature is obtained by encoding a series of frame-level feature vectors with a temporal encoder. Experiments on two datasets, CAD-120 and Something-Else, validate that our approach achieves 7.8% and 15.2% accuracy improvement on 1-shot task, 4.7% and 15.7% on 5-shot task, which outperforms the state-of-the-art methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have