Abstract
In this paper, we introduce a retrieval framework designed for e-commerce applications, which employs a multi-modal approach to represent items of interest. This approach incorporates both textual descriptions and images of products, alongside a locality-sensitive hashing (LSH) indexing scheme for rapid retrieval of potentially relevant products. Our focus is on a data-independent methodology, where the indexing mechanism remains unaffected by the specific dataset, while the multi-modal representation is learned beforehand. Specifically, we utilize a multi-modal architecture, CLIP, to learn a latent representation of items by combining text and images in a contrastive manner. The resulting item embeddings encapsulate both the visual and textual information of the products, which are then subjected to various types of LSH for balancing between result quality and retrieval speed. We present the findings of our experiments conducted on two real-world datasets sourced from e-commerce platforms, comprising both product images and textual descriptions. Promising results have been achieved, demonstrating favorable retrieval time and average precision. These results were obtained through testing the approach with a specifically selected set of queries and with synthetic queries generated using a Large Language Model.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have