Abstract
Natural Language Processing has achieved remarkable performance in multitudinous computer tasks, but the potential capability of textual information has not been completely explored in visual saliency detection. In this paper, we learn to detect salient object from natural language by addressing the two essential issues: finding a semantic content matching the corresponding linguistic concept and recovering fine details without any pixel-level annotations. We first propose the Feature Matching Network (FMN) to explore the internal relation between the linguistic concept and visual image in the semantic space. The FMN simultaneously establishes the textual-visual pairwise affinities and generates a language-aware coarse saliency map. to refine the coarse map, the Recurrent Fine-tune Network (RFN) is proposed to enhance its predicted performance progressively by self-supervision. Our approach only leverages the caption to provide important cues of salient object, but generates a fine-detailed foreground map at a detecting speed of 72 FPS without any post-processing. Extensive experiments demonstrate that our method takes full advantage of textual information of natural language in saliency detection, and performs favorably against state-of-the-art approaches on the most existing datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.