Abstract

Existing scene graph generation (SGG) methods are far from practical, primarily due to their poor performance on predicting zero-shot (i.e., unseen) subject-predicate-object triples. We observe that these SGG methods treat images along with the triples in them independently and thus fail to consider the complex and hidden information that is inherently implicit in the triples of other images. To this effect, our paper proposes a novel encoder-decoder SGG framework to leverage the semantic correlations between the triples of different images into the prediction of a zero-shot triple. Specifically, the encoder aggregates the triples in each image of training set into a large knowledge graph and learns the entity embeddings that capture the features of their neighborhoods with a relational graph neural network. The neighborhood-aware embeddings are then fed into the vision-based decoder to predict the predicates in images. Extensive experiments on the popular benchmark Visual Genome demonstrate that our proposed method outperforms the state-of-the-art methods in popular zero-shot metrics (i.e., zR@N, ngzR@N) for all SGG tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call