Abstract

In many visual domains (like fashion, furniture, etc.) the search for products on online platforms requires matching textual queries to image content. For example, the user provides a search query in natural language (e.g.,pink floral top) and the results obtained are of a different modality (e.g., the set of images of pink floral tops). Recent work on multimodal representation learning enables such cross-modal matching by learning a common representation space for text and image. While such representations ensure that the n-dimensional representation of pink floral top is very close to representation of corresponding images, they do not ensure that the first k1 (< n) dimensions correspond to color, the next k2 (< n) correspond to style and so on. In other words, they learn entangled representations where each dimension does not correspond to a specific attribute. We propose two simple variants which can learn disentangled common representations for the fashion domain wherein each dimension would correspond to a specific attribute (color, style, silhoutte, etc.). Our proposed variants can be integrated with any existing multimodal representation learning method. We use a large fashion dataset of over 700K fashion items crawled from multiple fashion e-commerce portals to evaluate the learned representations on four different applications from the fashion domain, namely, cross-modal image retrieval, visual search, image tagging, and query expansion. Our experimental results show that the proposed variants lead to better performance for each of these applications while learning disentangled representations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.