Abstract

Natural Language Processing has achieved remarkable performance in multitudinous computer tasks, but the potential capability of textual information has not been completely explored in visual saliency detection. In this paper, we learn to detect salient object from natural language by addressing the two essential issues: finding a semantic content matching the corresponding linguistic concept and recovering fine details without any pixel-level annotations. We first propose the Feature Matching Network (FMN) to explore the internal relation between the linguistic concept and visual image in the semantic space. The FMN simultaneously establishes the textual-visual pairwise affinities and generates a language-aware coarse saliency map. to refine the coarse map, the Recurrent Fine-tune Network (RFN) is proposed to enhance its predicted performance progressively by self-supervision. Our approach only leverages the caption to provide important cues of salient object, but generates a fine-detailed foreground map at a detecting speed of 72 FPS without any post-processing. Extensive experiments demonstrate that our method takes full advantage of textual information of natural language in saliency detection, and performs favorably against state-of-the-art approaches on the most existing datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call