Abstract

The existing works on human-object interaction (HOI) detection usually rely on expensive large-scale labeled image datasets. However, in real scenes, labeled data may be insufficient, and some rare HOI categories have few samples. This poses great challenges for deep-learning-based HOI detection models. Existing works tackle it by introducing compositional learning or word embedding but still need large-scale labeled data or extremely rely on the well-learned knowledge. In contrast, the freely available unlabeled videos contain rich motion-relevant information that can help infer rare HOIs. In this article, we creatively propose a multitask learning (MTL) perspective to assist in HOI detection with the aid of motion-relevant knowledge learning on unlabeled videos. Specifically, we design the appearance reconstruction loss (ARL) and sequential motion mining module in a self-supervised manner to learn more generalizable motion representations for promoting the detection of rare HOIs. Moreover, to better transfer motion-related knowledge from unlabeled videos to HOI images, a domain discriminator is introduced to decrease the domain gap between two domains. Extensive experiments on the HICO-DET dataset with rare categories and the V-COCO dataset with minimum supervision demonstrate the effectiveness of motion-aware knowledge implied in unlabeled videos for HOI detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call