Abstract

On account of a large scale of dataset need to be annotated to train the deep learning based modern object detection model, zero-shot object detection has become an important research field which aims to simultaneously localize and recognize unseen objects that are not observed during training. In order to improve the performance of zero-shot object detection, recent state of the art methods tend to make complicated modifications to the modern object detectors in terms of the model structure, loss function and training process. They always take the simple modification as a baseline, and think it is worse than more complicated methods. In contrast, we find that simple modification can achieve better performance. Considering that the redundant modification may increase the risk of over-fitting in seen classes and reduce generalization performance on unseen classes, we propose a visual language based succinct zero-shot object detection framework, which only replaces the classification branch in the modern object detector with a lightweight visual-language network. Since zero-shot object detection is a classic multi-modal learning protocol which consists of a visual feature space and a language space, our visual-language network learns the visual language alignment from the image and language data of seen classes and transfers this alignment to detect unseen objects. Following the Occam's razor principle that Entities should not be multiplied unnecessarily, extensive experimental results show that our succinct framework can suppress all existing zero-shot object detection methods on several benchmarks and gets the new state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call