Scene Graph Generation (SGG) plays an important role in scene understanding because all of the objects and relations in an image can be abstracted into a concise topological graph. Due to the complexity of visual scenes, including mutual occlusion between objects and semantic ambiguity, SGG is still a challenging task. Most of the existing models only focused on the context of a single object while contexts provided by paired objects are ignored. In this paper, we propose an Attention Redirection Transformer (ART) to extract pair-level contexts specifically, which is divided into an attention distraction stage and an attention integration stage. In this way, the attention of the model is forced to be redirected, which explores the implicit information in the background. In addition, to incorporate the semantic information of predicates, a Semantic Oriented Learning Module (SOL) is designed, which may assist in getting better textual semantics and also prompts cross-modal information fusion. At last, a self-diversity driven Dual Translation Embedding Module (DTM) is designed, which refines representations of subject and object and makes them distinct. Experimental results on the Visual Genome dataset demonstrate the effectiveness of our proposed method. Moreover, our method outperforms state-of-the-art methods on the mR@K metric. The source codes are released on Github: https://github.com/Nora-Zhang98/ART-SOL.
Read full abstract