Zero-shot Scene Graph Generation with Relational Graph Neural Networks

Xiang Yu,Chentao Wu,Shijing Yuan,Jie Li,Chao Wang

doi:10.1109/icpr56361.2022.9956712

Abstract

Existing scene graph generation (SGG) methods are far from practical, primarily due to their poor performance on predicting zero-shot (i.e., unseen) subject-predicate-object triples. We observe that these SGG methods treat images along with the triples in them independently and thus fail to consider the complex and hidden information that is inherently implicit in the triples of other images. To this effect, our paper proposes a novel encoder-decoder SGG framework to leverage the semantic correlations between the triples of different images into the prediction of a zero-shot triple. Specifically, the encoder aggregates the triples in each image of training set into a large knowledge graph and learns the entity embeddings that capture the features of their neighborhoods with a relational graph neural network. The neighborhood-aware embeddings are then fed into the vision-based decoder to predict the predicates in images. Extensive experiments on the popular benchmark Visual Genome demonstrate that our proposed method outperforms the state-of-the-art methods in popular zero-shot metrics (i.e., zR@N, ngzR@N) for all SGG tasks.

Full Text