Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods.
Read full abstract