Video moment retrieval (VMR) aims to localize the target moment in an untrimmed video according to the given nature language query. The existing algorithms typically rely on clean annotations to train their models. However, making annotations by human labors may introduce much noise. Thus, the video moment retrieval models will not be well trained in practice. In this article, we present a simple yet effective video moment retrieval framework via bottom-up schema, which is in end-to-end manners and robust to noisy label training. Specifically, we extract the multimodal features by syntactic graph convolutional networks and multihead attention layers, which are fused by the cross gates and the bilinear approach. Then, the feature pyramid networks are constructed to encode plentiful scene relationships and capture high semantics. Furthermore, to mitigate the effects of noisy annotations, we devise the multilevel losses characterized by two levels: a frame-level loss that improves noise tolerance and an instance-level loss that reduces adverse effects of negative instances. For the frame level, we adopt the Gaussian smoothing to regard noisy labels as soft labels through the partial fitting. For the instance level, we exploit a pair of structurally identical models to let them teach each other during iterations. This leads to our proposed robust video moment retrieval model, which experimentally and significantly outperforms the state-of-the-art approaches on standard public datasets ActivityCaption and textually annotated cooking scene (TACoS). We also evaluate the proposed approach on the different manual annotation noises to further demonstrate the effectiveness of our model.
Read full abstract