A semi-supervised mixture model of visual language multitask for vehicle recognition

Wenjin Liu,Shudong Zhang,Lijuan Zhou,Ning Luo,Min Xu

doi:10.1016/j.asoc.2024.111619

Abstract

Effective vehicle detection is extremely vital in cities. However, existing vehicle recognition models exhibit limited functionality and have not been effectively integrated with language models to aid in vehicle monitoring within cities. To address these challenges, this paper proposes a semi-supervised mixture model with detection, segmentation, and question-answering capabilities, referred to as the Multi-task Visual Language Model (MTVLM), for vehicle recognition. Further, a vehicle dataset containing 8425 vehicle images from 13 different categories is introduced, which is collected across various urban scenarios and weather conditions. MTVLM consists of a detector, a language model, and a segmentation model. The detector YOLODS is based on an improved You Only Look Once algorithm (YOLOv4), which has a lower number of parameters and higher detection accuracy. The detection results of YOLODS are then used as prompts for the pre-trained language model, a distilled version of Bidirectional Encoder Representations from Transformers (DistilBERT), to answer questions related to the detected vehicles. Furthermore, the coordinates of bounding boxes obtained by YOLODS are input into the Segment Anything Model (SAM) as prompts to segment vehicles from images. Experimental results show that MTVLM achieves a mean average precision (mAP) of 97.12 % on the vehicle dataset, surpassing other state-of-the-art (SOTA) detectors. The high-precision detector also ensures highly accurate text prompts for input to DistilBERT and SAM. By employing text prompts, the potential and performance of the pre-trained models DistilBERT and SAM can be fully explored and utilized to better complete question-answering and segmentation tasks.

Full Text