Image aesthetics assessment (IAA) has attracted growing interest in recent years but is still challenging due to its highly abstract nature. Nowadays, more and more people tend to comment images shared on the social networks, which can provide rich aesthetics-aware semantic information from different aspects. Therefore, user comments of an image can be exploited as supplementary information for enhancing aesthetic representation learning. Previous researches have demonstrated that aesthetic attributes make significant effect on image aesthetic quality and humans’ aesthetic perception. Typically, people are used to give comments on an image from the perspective of aesthetic attributes, based on which the aesthetic quality of images can be inferred. Motivated by this, this paper presents an Attribute-assisted Multimodal Memory Network (AMM-Net) for image aesthetics assessment, which utilizes aesthetic attributes to model the interactions between visual and textual modalities. Specifically, we design two memory networks to capture the attribute-aware information most related to the image and associated comments respectively. Further, with multiple memory hops, attribute semantics shared by the two modalities are refined and cross-modal interactions are enhanced progressively. Finally, more discriminative aesthetic representations can be obtained for IAA. The experimental results and comparisons on two public multimodal IAA datasets demonstrate the superiority of the proposed model over the state-of-the-art methods. The source code is available at https://github.com/zhutong0219/AMM-Net.
Read full abstract