Abstract

The key point in multimodal learning is to learn semantic alignment that finds the correspondence between sub-elements of instances from different modality data. Attention mechanism has shown its power in semantic alignment learning as it enables to densely associate sub-elements across different modalities. However, for each sub-element, existing attention aligns it with all the sub-elements from another modality, while most of them have no correspondence with it, i.e. irrelevant sub-elements. The irrelevant sub-elements will distract the semantic alignment if they are also attended. In this paper, we propose a novel focal attention mechanism to learn more accurate semantic alignment. The focal attention sparsely attends to a subset of sub-elements, which are identified as relevant ones according to their posterior probabilities given each sub-element from another modality. Based on the observation that relevant sub-elements mostly describe the same semantic, the posterior probability can precisely distinguish relevant and irrelevant ones by taking interactions within the same modality into consideration, such that relevant sub-elements get higher and closer posterior probabilities, while irrelevant ones get lower probabilities. Such a design learns better semantic alignment by preventing the interference of irrelevant sub-elements, and it facilitates subsequent multimodal tasks that demand semantic alignment. To validate the effectiveness of the focal attention, we conduct extensive experiments on image-text matching and text-to-image generation, and we propose a bidirectional and stacked version of focal attention for them, respectively. Experimental results on benchmarks show that the focal attention can significantly and consistently outperform state-of-the-arts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.