Few-shot human–object interaction video recognition with transformers

Qiyue Li,Xuemei Xie,Jin Zhang,Guangming Shi

doi:10.1016/j.neunet.2023.01.019

Qiyue Li, Xuemei Xie + Show 2 more

https://doi.org/10.1016/j.neunet.2023.01.019

Copy DOI

Export

Save

Cite

Journal: Neural Networks	Publication Date: Feb 10, 2023
Citations: 14

Affiliation: Xidian University

Abstract
Full-Text
Similar Papers

Abstract

Listen

We propose a novel few-shot learning framework that can recognize human–object interaction (HOI) classes with a few labeled samples. We achieve this by leveraging a meta-learning paradigm where human–object interactions are embedded into compact features for similarity calculation. More specifically, spatial and temporal relationships of HOI in videos are constructed with transformers which boost the performance over the baseline significantly. First, we present a spatial encoder that extracts the spatial context and infers frame-level features of a human and objects in each frame. And then the video-level feature is obtained by encoding a series of frame-level feature vectors with a temporal encoder. Experiments on two datasets, CAD-120 and Something-Else, validate that our approach achieves 7.8% and 15.2% accuracy improvement on 1-shot task, 4.7% and 15.7% on 5-shot task, which outperforms the state-of-the-art methods.

Full Text