Image Retrieval with Text Feedback based on Transformer Deep Model

Truc Luong-Phuong Huynh,Ngoc Quoc Ly

doi:10.1109/nics54270.2021.9701539

Abstract

Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms. Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback. This is a tricky task as it requires a good understanding of image, text, and also their combination. In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images. By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval. We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ. ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language. We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change.

Full Text