Abstract

In recent years, multimodal tasks are receiving more and more attentions. Referring expression comprehension aims to find the object by a natural language sentence. The challenge of this task is that the system should be able to recognize images and understand the text to determine the corresponding target. In this paper, an interpretable method is proposed to perform referring expression comprehension by decomposing the input sentences into semantic units. The proposed method learns the joint text and visual features and generates the attention map of the target object. Visualization of comprehension progress can be obtained in the end to further analyze how deep model interpret the query object.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call