Empowered by the continuous integration of social multimedia and artificial intelligence, the application scenarios of information retrieval (IR) progressively tend to be diversified and personalized. Currently, User-Generated Content (UGC) systems have great potential to handle the interactions between large-scale users and massive media contents. As an emerging multimedia IR, Fashion Compatibility Modeling (FCM) aims to predict the matching degree of each given outfit and provide complementary item recommendation for user queries. Although existing studies attempt to explore the FCM task from a multimodal perspective with promising progress, they still fail to fully leverage the interactions between multimodal information or ignore the item-item contextual connectivities of intra-outfit. In this paper, a novel fashion compatibility modeling scheme is proposed based on Correlation-aware Cross-modal Attention Network. To better tackle these issues, our work mainly focuses on enhancing comprehensive multimodal representations of fashion items by integrating the cross-modal collaborative contents and uncovering the contextual correlations. Since the multimodal information of fashion items can deliver various semantic clues from multiple aspects, a modality-driven collaborative learning module is presented to explicitly model the interactions of modal consistency and complementarity via a co-attention mechanism. Considering the rich connections among numerous items in each outfit as contextual cues, a correlation-aware information aggregation module is further designed to adaptively capture significant intra-correlations of item-item for characterizing the content-aware outfit representations. Experiments conducted on two real-world fashion datasets demonstrate the superiority of our approach over state-of-the-art methods.
Read full abstract