Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

Da Cao,Yawen Zeng,Zheng Qin,Richang Hong,Xiaochi Wei,Liqiang Nie

doi:10.1145/3394171.3413841

Abstract

Retrieving video moments from an untrimmed video given a natural language as the query is a challenging task in both academia and industry. Although much effort has been made to address this issue, traditional video moment ranking methods are unable to generate reasonable video moment candidates and video moment localization approaches are not applicable to large-scale retrieval scenario. How to combine ranking and localization into a unified framework to overcome their drawbacks and reinforce each other is rarely considered. Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. The key of our solution is to formulate the video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a pairwise ranking model is utilized as a discriminator to rank the generated video moments and the ground truth. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning framework, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experiments on two well-known datasets have well verified the effectiveness and rationality of our proposed solution.

Full Text