Abstract

Human–object interaction (HOI) detection aims to localize and classify triplets of human, object and interaction from a given image. Earlier two-stage methods suffer both from mutually independent training processes and the interference of redundant negative human–object pairs. Prevailing one-stage transformer-based methods are free from the above problems by tackling HOI in an end-to-end manner. However, one-stage transformer-based methods carry the unnecessary entanglements of the query for different tasks, i.e., human–object detection and interaction classification, and thus bring in poor performance. In this paper, we propose a new transformer-based approach that parallelly disentangles human–object detection and interaction classification in a triplet-wise manner. To make each query focus on one specific task clearly, we exhaustively disentangle HOI by parallelly expanding the naive query in vanilla transformer as triple explicit queries. Then, we introduce a semantic communication layer to preserve the consistent semantic association of each HOI through mixing the feature representations of each query triplet of the correspondence constraint. Extensive experiments demonstrate that our proposed framework outperforms the existing methods and achieves the state-of-the-art performance, with significant reduction in parameters and FLOPs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call