A text-based vehicle search refers to a system where users can find vehicles or route information by entering text-based queries. The primary objective of text-based vehicle search is to identify the most relevant vehicle in a given dataset using a natural language description as a query. This approach leverages natural language processing (NLP) to understand and interpret description queries and provide relevant results. Despite significant progress, this task still faces several challenges due to the complexity and diversity of natural language, as well as inherent difficulties in the vision domain. Moreover, few studies have focused on tracked-vehicle retrieval, where vehicle tracklets are considered instead of single images. In this paper, we propose a novel framework for natural language-based tracked-vehicle retrieval based on CLIP model, one of the most effective models for image-text matching task. This framework leverages both appearance and motion information to enhance the matching accuracy of vehicle tracklet retrieval. Some experiments are conducted on the CityFlow-NL dataset, provided by the 6-th AI City Challenge, an annual competition. The results are comparable to state-of-the-art methods, achieving an MRR score of 46.63%, Rank@5 of 67.02%, and Rank@10 of 81.82%
Read full abstract