Abstract
Human-Object Interaction (HOI) detection is the cornerstone of advanced visual understanding, aiming to identify relationships and interactions among different objects in images. Transformer-based previous methods commonly utilize traditional query embeddings to predict HOI, but this approach suffers from slow training convergence. Although recent research defines HOI queries as reference points, their semantic information remains ambiguous, ignoring object scale differences. To address these issues, we propose to use anchor boxes as queries for HOI detection for the first time, which can significantly accelerate the convergence speed. Furthermore, in order to enable anchor boxes to focus on HOI features efficiently, we designed an end-to-end Specific Query Anchor Boxes (SQAB) network. Our method includes a Hierarchical Detection Branch (HDB) and an Interaction Refinement Branch (IRB). Firstly, HDB uses specific query anchor boxes for prediction on multi-scale feature maps and uses relation content queries to associate contextual information. In addition, IRB utilizes multi-scale body part masks to guide the model to focus on key interaction regions effectively between humans and objects, improving the performance of interaction categories. Experimental results show that SQAB performs superior to the baseline, only requiring 25 epochs of the training cycles on the widely used HOI benchmark datasets (V-COCO, HICO-DET, and HOI-A). On the HICO-DET and HOI-A datasets, mean average precision(mAP) increased by approximately 5.99 % and 3.02%, respectively. On the V-COCO dataset, SQAB increases mAP by up to 10.57%.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have