Abstract

Scene graphs connect individual objects with visual relationships. They serve as a comprehensive scene representation for downstream multimodal tasks. However, by exploring recent progress in Scene Graph Generation (SGG), we find that the performance of recent works is highly limited by the pairwise relationship modeling by naive feature concatenation. Such pairwise features lack sufficient object interaction due to the mis-aligned object parts, resulting in non-discriminative pairwise features for visual relationship prediction. For example, naive concatenated pairwise feature usually make the model fail to discriminate between <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">riding</monospace> and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">feeding</monospace> for object pair <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">person</monospace> and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">horse</monospace> . To this end, we design a meta-architecture— learning-to-align — for dynamic object feature concatenation. We call our model: <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Align R-CNN</b> . Specifically, we introduce a novel attention-based multiple region alignment module that can be jointly optimized with SGG. Experiments on the large-scale SGG benchmark Visual Genome show that the proposed <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Align R-CNN</b> can replace the naive feature concatenation and thus boost all the existing SGG methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.