Abstract

The perception of emotion and the diversity of generated response are two key factors considered by researchers in multimodal dialogue generation. However, in the field of multimodal dialogue generation, these two key factors have not been considered at the same time. In our model, we first extract the features of each modal from the multimodal context dialogue, and use the heterogeneous graph neural network to represent the large graph network composed of dialogue history, voice, video, and speaker's emotional state. Then, we use conditional variational autoencoders to generate coherent and diverse responses. A large number of experiments have shown that our model can not only automatically generate reaction emotions in two multimodal datasets, but also has coherence and controllability, which is significantly better than previous more advanced models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call