Abstract

Referring expression comprehension aims to locate a target object in an image described by a referring expression, where extracting semantic and discriminative visual information plays an important role. Most existing methods either ignore attribute information or context information in the model learning procedure, thus resulting in less effective visual features. In this paper, we propose a Multi-level Attention Network (MANet) to extract the target attribute information and the surrounding context information simultaneously for the target object, where the Attribute Attention Module is designed to extract the fine-grained visual information related to the referring expression and the Context Attention Module is designed to merge the context information of surroundings to learn more discriminative visual features. Experiments on various common benchmark datasets show the effectiveness of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call