Multi-level attention for referring expression comprehension

Yanfeng Sun,Yunru Zhang,Huajie Jiang,Yongli Hu,Baocai Yin

doi:10.1016/j.patrec.2023.07.005

Abstract

Referring expression comprehension aims to locate a target object in an image described by a referring expression, where extracting semantic and discriminative visual information plays an important role. Most existing methods either ignore attribute information or context information in the model learning procedure, thus resulting in less effective visual features. In this paper, we propose a Multi-level Attention Network (MANet) to extract the target attribute information and the surrounding context information simultaneously for the target object, where the Attribute Attention Module is designed to extract the fine-grained visual information related to the referring expression and the Context Attention Module is designed to merge the context information of surroundings to learn more discriminative visual features. Experiments on various common benchmark datasets show the effectiveness of our approach.

Full Text