Abstract

Detection transformers (DETR) have provided a novel solution to human–object interaction detection in a set-prediction manner, thanks to expressive learnable queries. However, few studies covered how queries implicitly affect model behaviors, which might contain clues for model improvement. Therefore, we propose two dataset-based analysis tools: score state space and query preference map. They provide data on the distribution of model predictions, which can reveal overall-level and query-level model properties. Starting with our baseline model, we find the model naturally regresses object boxes to overlap human boxes in the edge case of no-object verbs (stand, etc.), even without a related loss. We encourage this by patching the supervision with virtual objects, resulting in more stable query preference. Moreover, we show that two-stage decoders designed for cascade inference do not decouple tasks as intended. We infer this is caused by the empty instances used as negative samples, which suggests a redesign in the matching scheme. Further, we reveal how adding an oracle-query-based teacher model affects query roles with a tiny gain, indicating room for refinement. Our findings demonstrate how a simple focus on query behaviors can provide insights for improving models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call