Abstract

The recent advancements in multimodal dialogue systems have been gaining importance in several domains such as retail, travel, fashion, among others. Several existing works have improved the understanding and generation of multimodal dialogues. However, there still exists considerable space to improve the quality of output textual responses due to insufficient information infusion between the visual and textual semantics. Moreover, the existing dialogue systems often generate defective knowledge-aware responses for tasks such as providing product attributes and celebrity endorsements. To address the aforementioned issues, we present a Transformer-based Multimodal Infusion Dialogue (TMID) system that extracts the visual and textual information from dialogues via a transformer-based multimodal context encoder and employs a cross-attention mechanism to achieve information infusion between images and texts for each utterance. Furthermore, TMID uses adaptive decoders to generate appropriate multimodal responses based on the user intentions it has determined using a state classifier and enriches the output responses by incorporating domain knowledge into the decoders. The results of extensive experiments on a multimodal dialogue dataset demonstrate that TMID has achieved a state-of-the-art performance by improving the BLUE-4 score by 13.03, NIST by 2.77, image selection Recall@1 by 1.84%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call