Abstract

Despite continuous performance improvements, contemporary Scene Graph (SG) systems tend to generate ‘fragmented’ graphs. A central problem is that standard metrics only measure similarity to ground truth graphs at the triplet level and may not fully capture image relevance or semantic correctness. In particular, multiple triplet predictions are usually made for the same ground truth regions, which can be considered as a trivial method to improve the standard evaluation metric, i.e. recall. The central purpose of our work is to reveal the inherent drawback of current SG evaluation methods and the resultant redundancy issue. We investigate different types of graph artifacts in SGs generated by existing models and propose two graph quality metrics to evaluate the level of fragmentation. Detailed analysis is given to show how SG model architectures contributes to graph fragmentation. We study these problems in the context of graph semantic quality assessment. Qualitative assessment via human study is conducted to evaluate the semantic consistency between the proposed metrics and human perception. To further clarify the validity of the new source of error, a simple but effective method which targets graph fragmentation is presented. Systematic experiments are conducted with the standard Visual Genome (VG) dataset and the Visual Relationship Detection (VRD) dataset. Experimental results show that our proposed system significantly improves the scene graph quality in terms of the new metrics as well as the traditional Top-N recall values.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call