Aiming at the problem of insufficient use of human–object interaction (HOI) information and spatial location information in images, we propose a human–object interaction detection network based on graph structure and improved cascade pyramid. This network is composed of three branches, namely, graph branch, human–object branch and human pose branch. In graph branch, we propose a Graph-based Interactive Feature Generation Algorithm (GIFGA) to address the inadequate utilization of interaction information. GIFGA constructs an initial dense graph model by taking humans and objects as nodes and their interaction relationships as edges. Then, by traversing each node, the graph model is updated to generate the final interaction features. In human pose branch, we propose an Improved Cascade Pyramid Network (ICPN) to tackle the underutilization of spatial location information. ICPN extracts human pose features and maps both the object bounding boxes and extracted human pose maps onto the global feature map to capture the most discriminative interaction-related region features within the global context. Finally, the features from the three branches are fed into a Multi-Layer Perceptron (MLP) for fusion and then classified for recognition. Experimental results demonstrate that our network achieves mAP of 54.93% and 28.69% on the V-COCO and HICO-DET datasets, respectively.
Read full abstract