Abstract

Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms. Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback. This is a tricky task as it requires a good understanding of image, text, and also their combination. In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images. By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval. We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ. ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language. We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.