Abstract

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects via inferring triplets of 〈 human, verb, object 〉 . Recent HOI detection methods infer HOIs by directly extracting appearance features and spatial configuration from related visual targets of human and object, but neglect powerful interactive semantic reasoning between these targets. Meanwhile, existing spatial encodings of visual targets have been simply concatenated to appearance features, which is unable to dynamically promote the visual feature learning. To solve these problems, we first present a novel semantic-based Interactive Reasoning Block, in which interactive semantics implied among visual targets are efficiently exploited. Beyond inferring HOIs using discrete instance features, we then design a HOI Inferring Structure to parse pairwise interactive semantics among visual targets in scene-wide level and instance-wide level. Furthermore, we propose a Spatial Guidance Model based on the location of human body-parts and object, which serves as a geometric guidance to dynamically enhance the visual feature learning. Based on the above modules, we construct a framework named Interactive-Net for HOI detection, which is fully differentiable and end-to-end trainable. Extensive experiments show that our proposed framework outperforms existing HOI detection methods on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.9% and 17.7% relatively, validating its efficacy in detecting HOIs.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call