Abstract

Scene graph generation aims to describe the contents in scenes by identifying the objects and their relationships. In previous works, visual context is widely utilized in message passing networks to generate the representations for classification. However, the noisy estimation of visual context limits model performance. In this paper, we revisit the concept of incorporating visual context via a randomly ordered bidirectional Long Short Temporal Memory (biLSTM) based baseline, and show that noisy estimation is worse than random. To alleviate the problem, we propose a new method, dubbed Progressive Message Passing Network (PMP-Net) that better estimates the visual context in a coarse to fine manner. Specifically, we first estimate the visual context with a random initiated scene graph, then refine it with multi-head attention. The experimental results on the benchmark dataset Visual Genome show that PMP-Net achieves better or comparable performance on all three tasks: scene graph generation (SGGen), scene graph classification (SGCls), and predicate classification (PredCls).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.