Abstract
In this paper, we tackle the cross-lingual language-to-vision (CLLV) retrieval task. In the CLLV retrieval task, given the text query in one language, it seeks to retrieve the relevant images/videos from the database based on visual content in images/videos and their captions in another language. As the CLLV retrieval bridges the modal gap and the language gap, it makes many international cross-modal applications feasible. To tackle the CLLV retrieval, in this paper, we propose an assorted attention network (A2N) to synchronously overcome the language gap, bridge the modal gap and fuse features of two modals in an elegant and effective manner. It represents each text query as a set of word features and represents each image/video as a set of its caption's word features in another language and a set of its local visual features. In this case, the relevance between the text query and the image/video is obtained by the matching between the set of query's word features and two sets of image/video features. To enhance the effectiveness of the matching, A2N merges the query's word features and the image/video's visual and word features into an assorted set and further conducts the self-attention operation on items of the assorted set. On one hand, benefited from the attentions between the query's word features and the video/image's visual features, some important word features or visual features of the image/video can be emphasized. On the other hand, benefited from the attentions between the video/image's visual features and its caption word features, the image/video's visual content and the text information can be fused in a more effective manner. Systematic experiments conducted on four datasets demonstrate the effectiveness of the proposed A2N in the CLLV retrieval task.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.