Nonverbal behaviours are integral parts of human social interaction. Equipping social robots with human nonverbal communication skills has been an active research area for decades, where data-driven, end-to-end learning approaches have become predominant in recent years, offering scalability and generalisability. However, most of the current works only consider social signals of a single character to model co-speech gestures in non-interactive settings. To address this shortcoming, this paper introduces a context-aware Generative Adversarial Network, intending to produce social cues for robots. The approach captures both intra- and interpersonal social signals of two interlocutors to model body gestures in dyadic interaction. We conducted a series of experiments to validate the proposed solution under different interaction settings. First, the experimental results conducted in the JESTKOD dataset demonstrate the contribution of encoding context, namely the behaviours of the interaction partner, in the prediction of target person's gestures in agreement situations. Second, the experiments conducted in the new LISI-HHI dataset show that combining Discriminator and Context Encoder results in a gesture generation framework that is effective across various social communication contexts.
Read full abstract