Confidence-based dynamic cross-modal memory network for image aesthetic assessment

Xiaodan Zhang,Yuan Xiao,Jinye Peng,Xinbo Gao,Bo Hu

doi:10.1016/j.patcog.2023.110227

Abstract

Image aesthetic assessment (IAA) aims to design algorithms that can make human-like aesthetic decisions. Due to its high subjectivity and complexity, visual information alone is limited to fully predict the aesthetic quality of an image. More and more researchers try to use complementary information from user comments. However, user comments are not always available due to various technical and practical reasons. Therefore, it is necessary to find a way to reconstruct the missing textual information for aesthetic prediction with visual information only. This paper solves this problem by proposing a Confidence-based Dynamic Cross-modal Memory Network (CDCM-Net). Specifically, the proposed CDCM-Net consists of two key components: Visual and Textual Memory (VTM) network and Confidence-based Dynamical Multi-modal Fusion module (CDMF). VTM is based on the key–value memory network. It consists of a visual key memory and a textual value memory. The visual key memory learns the visual information. While the textual value memory learns to remember the textual feature and align them with the corresponding visual features. During inference, textual information can be reconstructed using only visual features. Furthermore, a CDMF module is introduced to perform trustworthy fusion. CDMF evaluates modality-level informativeness and then dynamically integrates reliable information. Extensive experiments are performed to demonstrate the superiority of the proposed method.

Full Text