Toward Region-Aware Attention Learning for Scene Graph Generation.

An-An Liu,Weizhi Nie,Ning Xu,Hongshuo Tian,Yongdong Zhang,Mohan Kankanhalli

doi:10.1109/tnnls.2021.3086066

Abstract

Scene graph generation (SGGen) is a challenging task due to a complex visual context of an image. Intuitively, the human visual system can volitionally focus on attended regions by salient stimuli associated with visual cues. For example, to infer the relationship between man and horse, the interaction between human leg and horseback can provide strong visual evidence to predict the predicate ride. Besides, the attended region face can also help to determine the object man. Till now, most of the existing works studied the SGGen by extracting coarse-grained bounding box features while understanding fine-grained visual regions received limited attention. To mitigate the drawback, this article proposes a region-aware attention learning method. The key idea is to explicitly construct the attention space to explore salient regions with the object and predicate inferences. First, we extract a set of regions in an image with the standard detection pipeline. Each region regresses to an object. Second, we propose the object-wise attention graph neural network (GNN), which incorporates attention modules into the graph structure to discover attended regions for object inference. Third, we build the predicate-wise co-attention GNN to jointly highlight subject's and object's attended regions for predicate inference. Particularly, each subject-object pair is connected with one of the latent predicates to construct one triplet. The proposed intra-triplet and inter-triplet learning mechanism can help discover the pair-wise attended regions to infer predicates. Extensive experiments on two popular benchmarks demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.

Full Text