Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

Katrien Laenen,Marie-Francine Moens

doi:10.3390/computers11120182

Abstract

Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers	Publication Date: Dec 10, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

Abstract

Talk to us

Similar Papers

More From: Computers

Lead the way for us

Similar Papers

Mutual Information Regularization for Weakly-Supervised RGB-D Salient Object Detection
Aixuan Li ... Jing Zhang
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 34
Aixuan Li, et. al.Aixuan Li ... Jing Zhang
01 Jan 2024
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 34

A review of disentangled representation learning for visual data processing and analysis
Yating Li ... Wenyi Chen
Journal of Image and Graphics | VOL. 28
Yating Li, et. al.Yating Li ... Wenyi Chen
01 Jan 2023
Journal of Image and Graphics | VOL. 28

Learning Effective Representations from Sparse Mutlimodal Data on Content Curation Social Networks
Lifang Wu ... Xiuzhen Zhang
-
Lifang Wu, et. al.Lifang Wu ... Xiuzhen Zhang
01 Nov 2019
01 Nov 2019

Deep Multi-modal Latent Representation Learning for Automated Dementia Diagnosis
Tao Zhou ... Ling Shao
-
Tao Zhou, et. al.Tao Zhou ... Ling Shao
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

Abstract

Talk to us

Similar Papers

More From: Computers