Abstract

Detecting relationships between objects is important for the complete understanding of visual scenes, which will be helpful for applications such as visual question answering, image search, and robotic interactions. It is however a challenging task due to the high variation of object appearance and interactions, and the often incomplete annotations. In this paper, we propose a fast method for visual relationship detection based on recurrent attention and negative sampling. First, to learn non-visual features, we use the Word2Vec model to extract semantic embedding features of object categories, and use binary masks to represent spatial location features. And we integrate the recurrent attention mechanism into the detection pipeline, enabling the network to focus on several specific parts of an image when scoring predicates for a given object pair. Then we use an undersampling technique to alleviate the influence of imbalanced annotations, particularly for zero-shot detection. The proposed method is simple but experiments prove that it is efficient and achieves state-of-the-art results on the benchmark VRD and Visual Genome (VG) datasets in most cases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call