Mask and Predict

Hongshuo Tian,An-An Liu,Chenggang Yan,Ning Xu,Zhendong Mao,Yongdong Zhang,Quan Zhang

doi:10.1145/3474085.3475545

Abstract

Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet logo on board. Intuitively, we humans can further refine these intuitive detections as rational descriptions like flower painted on surfboard. However, most of existing methods always formulate SGG as a straightforward task, only limited by the manner of one-time prediction, which focuses on a single-pass pipeline and predicts all the semantic. Therefore, to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely, we break SGG into two explicit learning stages, including intuitive training stage (ITS) and rational training stage (RTS). In the first stage, we follow the traditional SGG processing to detect objects and relationships, yielding an intuitive scene graph. In the second stage, we perform multi-step reasoning to refine the intuitive scene graph. For each step of reasoning, it consists of two kinds of operations: mask and predict. According to primary predictions and their confidences, we constantly select and mask the low-confidence predictions, which features are optimized and predicted again. After several iterations, all of intuitive semantics will gradually tend to be revised with high confidences, yielding a rational scene graph. Extensive experiments on Visual Genome prove the superiority of the proposed method. Additional ablation studies and visualization cases further validate its effectiveness.

Full Text