Abstract

Human–Object Interaction (HOI) detection has garnered considerable attention among computer vision researchers as it involves identifying and describing actions between humans and objects. Numerous approaches, such as sequential and end-to-end methods, have been proposed to tackle this problem, with a recent focus on exploring end-to-end systems. This study presents an enhanced end-to-end transformer-based human–object detector based on HOTR, which introduces three improvements. The proposed model improves instance representation through a simple yet effective mechanism, utilizes semantic information to provide contextual understanding and additional knowledge, and incorporates a cross-attention mechanism for fusing multi-level high-level feature maps within the Transformer architecture. Experimental results demonstrate significant performance gains over the baseline HOTR model, making it competitive with other state-of-the-art models on two widely-used datasets: V-COCO and HICO-DET.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.