Fashion compatibility modeling, which is used to estimate the matching degree of a given set of fashion items, has received increasing attention in recent years. However, existing studies often fail to fully leverage multimodal information or ignore the semantic guidance of clothing categories in elevating the reliability of multimodal information. In this paper, we propose a fashion compatibility modeling approach with a category-aware multimodal attention network, termed as FCM-CMAN. In FCM-CMAN, we focus on enriching and aggregating multimodal representations of fashion items by means of the dynamic representations of categories and a contextual attention mechanism simultaneously. Specifically, considering that category correlations are always dynamic and varied for different fashion items, we design a categorical dynamic graph convolutional network to adaptively learn the semantic correlations between categories. When combined with the multi-layered visual outputs of a convolutional neural network and the surrounding contextual information, multiple content-aware category representations and context-aware attention weights are obtained to better characterize fashion items from different aspects. On this basis, two pieces of aware information are integrated by a multimodal factorized bilinear pooling strategy to generate visual-semantic embeddings, which are further improved by a multi-head self-attention mechanism to capture significant elements related to fashion compatibility. Extensive experiments conducted on the FashionVC and ExpFashion datasets demonstrate the superiority of FCM-CMAN over state-of-the-art methods.
Read full abstract