MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation

Chuhan Wu,Tao Qi,Fangzhao Wu,Yongfeng Huang,Chao Zhang,Tong Xu

doi:10.1145/3477495.3531896

Abstract

News representation is critical for news recommendation. Most existing methods learn news representations only from news texts while ignoring the visual information of news. In fact, users may click news not only due to the interest in news titles but also the attraction of news images. Thus, images are useful for representing news and predicting news clicks. Pretrained visiolinguistic models are powerful in multi-modal understanding, which can represent news from both textual and visual contents. In this paper, we propose a multimodal news recommendation method that can incorporate both textual and visual information of news to learn multimodal news representations. We first extract region-of-interests (ROIs) from news images via object detection. We then use a pre-trained visiolinguistic model to encode both news texts and image ROIs and model their inherent relatedness using co-attentional Transformers. In addition, we propose a crossmodal candidate-aware attention network to select relevant historical clicked news for the accurate modeling of user interest in candidate news. Experiments validate that incorporating multimodal news information can effectively improve the performance of news recommendation.

Full Text