Given Natural Language (NL) text descriptions, NL-based vehicle retrieval aims to extract target vehicles from a multi-view multi-camera traffic video pool. Due to inherent distinctions between textual and visual data, this is a challenging multi-modal retrieval task that requires robust feature extractors (e.g. neural network) to well-align the abstract representations of texts and images in the same domain. However, solutions to the problem have been challenged by the high data complexities of not only the multi-view, multi-camera attributes of visual data and the diverse range of textual descriptions but also a lack of high-volume datasets in this relatively new field, alongside a prominently large domain gap between training and test sets. Many existing approaches have developed computationally expensive models to separately extract the subspaces of language and vision before blending into the same shared representation space while only focusing on single-modal information and ignoring much of the multi-modal information to deal with the aforementioned issues. Hence, we propose a Domain Adaptive Knowledge-based Retrieval System (DAKRS) to effectively and efficiently align multi-modal knowledge in a setting of limited labels. Our contributions are threefold: (i) An efficient extension of Contrastive Language-Image Pre-training (CLIP)’s transfer learning into a baseline text-to-image multi-modular vehicle retrieval framework; (ii) A data enhancement module to create pseudo-vehicle tracks from the traffic video pool by leveraging the robustness of baseline retrieval model combine with background subtraction; and (iii) A SSDA (SSDA) scheme to engineer pseudo-labels for adapting model parameters to the target domain distribution. Experimental results are benchmarked on the Cityflow-NL dataset, illustrating our competitiveness against state-of-the-art performances in terms of effectiveness and efficiency without needing further post-processing or ensembling.
Read full abstract