Abstract

Due to the natural multi-modal occurrence format (text, audio, vision) of the dialogues, textual response generation in dialogues should rely on the multi-modal contexts beyond text only. However, most existing studies normally ignore the rich information of other modalities, such as audio. To investigate the importance of the acoustic contexts, we explore the multi-modal dialogue scenario with aligned text and audio temporal sequences for textual response generation of an assumed system, namely RGMD task. To this end, we construct a new multi-modal dataset for this task based on TV shows, which contains 84.9K utterances. Considering the response diversity limited by the context and modality interactions for RGMD, we attempt the split pre-generation (SPG) strategy and the cross-modal contrastive learning (CCL) strategy in multi-modal pre-training for better response generation. On the one hand, with SPG, we can obtain many diverse responses without the restrictions of too many historical mixed multi-modal contexts. On the other hand, with CCL, we can capture the interactions between text and audio. Extensive experiments demonstrate that our approach based on BART can consistently perform better than the state-of-the-art textual approach DP by 4.17%, 8.96%, 2.43%, 1.04% and 7.54% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Moreover, our approach based on GPT can outperform the state-of-the-art multi-modal approach RLM by 6.79%, 9.25%, 7.49%, 9.31% and 13.75% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Besides, we conduct much in-depth analysis, showing the necessity of audio for response generation and further verifying the effectiveness of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call