Linguistic Descriptions of Human Motion with Generative Adversarial Seq2Seq Learning

Yusuke Goutsu,Tetsunari Inamura

doi:10.1109/icra48506.2021.9561519

Abstract

In this paper, we propose a generative model that learns a sequence-to-sequence (Seq2Seq) translation between human whole-body motions and linguistic descriptions by natural language. Our model merges the Seq2Seq model with the training strategy of sequence generative adversarial nets (SeqGAN), which extends a GAN framework to solve the problem that the gradient cannot pass back to the generator network. This model considers a generator, trained using a policy gradient method, as a stochastic parameterized policy. In the policy gradient, we employ a Monte Carlo (MC) search to receive the final reinforcement learning (RL) reward from the discriminator. The proposed generative network is trained on the KIT Motion-Language Dataset, which is one of the few large-scale datasets available and includes 3,911 human motions and 6,278 natural language descriptions. During the experiments, we evaluated the effectiveness of our model by comparing its various configurations and parameter settings. Finally, our model achieves a remarkably high performance, outperforming an existing state-of-the-art method under the same dataset split for fair comparison. In addition, the qualitative results of the motion-to-language translation demonstrate that our model can generate semantically and grammatically correct sentences with detailed linguistic descriptions from human motions.

Full Text