Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition

Konstantin Schall,Werner Bailer,Kai-Uwe Barthel,Fabio Carrara,Jakub Lokoč,Ladislav Peška,Klaus Schoeffmann,Lucia Vadicamo,Claudio Vairo

doi:10.1007/s13735-024-00325-9

Abstract

CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition Video Browser Showdown 2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems’ functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large-scale video database.

Full Text