Visual relationship detection with recurrent attention and negative sampling

Lei Wang,Peizhen Lin,Jun Cheng,Feng Liu,Xiaoliang Ma,Jianqin Yin

doi:10.1016/j.neucom.2020.12.099

Abstract

Detecting relationships between objects is important for the complete understanding of visual scenes, which will be helpful for applications such as visual question answering, image search, and robotic interactions. It is however a challenging task due to the high variation of object appearance and interactions, and the often incomplete annotations. In this paper, we propose a fast method for visual relationship detection based on recurrent attention and negative sampling. First, to learn non-visual features, we use the Word2Vec model to extract semantic embedding features of object categories, and use binary masks to represent spatial location features. And we integrate the recurrent attention mechanism into the detection pipeline, enabling the network to focus on several specific parts of an image when scoring predicates for a given object pair. Then we use an undersampling technique to alleviate the influence of imbalanced annotations, particularly for zero-shot detection. The proposed method is simple but experiments prove that it is efficient and achieves state-of-the-art results on the benchmark VRD and Visual Genome (VG) datasets in most cases.

Full Text