Textual response generation is a pivotal yet challenging task for multimodal task-oriented dialog systems, which targets at generating the appropriate textual response given the multimodal context. Although existing efforts have obtained remarkable advancements, they ignore the potential of the domain information in revealing the key points of the user intention and the user's history dialogs in indicating the user's characteristics. To address this issue, in this work, we propose a novel domain-aware multimodal dialog system with distribution-based user characteristic modeling (named DMDU). In particular, DMDU contains three vital components: context-knowledge embedding extraction , domain-aware response generation and distribution-based user characteristic injection . Specifically, the context-knowledge embedding extraction component aims to extract the embedding of multimodal context and related knowledge following existing studies. The domain-aware response generation component targets at conducting domain-aware fine-grained intention modeling based on the context and knowledge embedding, and thus fulfills the textual response generation. Moreover, the distribution-based user characteristic injection component first captures the user's characteristics and current intention with the Gaussian distribution, and then conducts the sampling-based contrastive semantic regularization to promote the context representation learning. Experimental results on the public dataset demonstrate the effectiveness of DMDU. We release codes to promote other researchers.
Read full abstract