Abstract

Relations between objects drive our understanding of images. Modelling them poses several challenges due to the combinatorial nature of the problem and the complex structure of natural language. This paper tackles the task of predicting relationships in the form of (subject, relation, object) triplets from still images. To address these issues, we propose a framework for learning relation prototypes that aims to capture the complex nature of relation distributions. Concurrently, a network is trained to define a space in which relationship triplets with similar spatial layouts, interacting objects and relations are clustered together. Finally, the network is compared to two models explicitly tackling the problem of synonymy among relations. For this, two well known scene-graph labelling benchmarks are used for training and testing: VRD and Visual Genome. Prediction of relations based on distance to prototype provides a significant increase in the diversity of predicted relations, improving the average relation recall from 40.3% to 41.7% on the first and 31.3% to 35.4% on the second.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call