To retrieve a video via a multimedia search engine, a textual query is usually created by the user and then used to perform the search. Recent state-of-the-art cross-modal retrieval methods learn a joint text–video embedding space by using contrastive loss functions, which maximize the similarity of positive pairs while decreasing that of the negative pairs. Although the choice of these pairs is fundamental for the construction of the joint embedding space, the selection procedure is usually driven by the relationships found within the dataset: a positive pair is commonly formed by a video and its own caption, whereas unrelated video-caption pairs represent the negative ones. We hypothesize that this choice results in a retrieval system with limited semantics understanding, as the standard training procedure requires the system to discriminate between groundtruth and negative even though there is no difference in their semantics. Therefore, differently from the previous approaches, in this paper we propose a novel strategy for the selection of both positive and negative pairs which takes into account both the annotations and the semantic contents of the captions. By doing so, the selected negatives do not share semantic concepts with the positive pair anymore, and it is also possible to discover new positives within the dataset. Based on our hypothesis, we provide a novel design of two popular contrastive loss functions, and explore their effectiveness on four heterogeneous state-of-the-art approaches. The extensive experimental analysis conducted on four datasets, EPIC-Kitchens-100, MSR-VTT, MSVD, and Charades, validates the effectiveness of the proposed strategy, observing, e.g., more than +20% nDCG on EPIC-Kitchens-100. Furthermore, these results are corroborated with qualitative evidence both supporting our hypothesis and explaining why the proposed strategy effectively overcomes it.
Read full abstract