Abstract

AbstractFor multi-label classification tasks, vision transformers exist two fundamental issues. First, ViT cannot provide a clear per-class saliency region for multi-label classification tasks. This greatly restricts the interpretability of ViT for multi-label classification. Second, when processing images with multiple categories, the single feature vector, i.e., class token, of ViT gathers features of foreground regions from different categories. In this way, the class token indeed interfuses features of multiple objects and could not well distinguish the key feature of multiple objects, thus restricting the network performance. To alleviate these issues, we present a Multi-Class-Tokens-based vision transformer for multi-label image classification. MCT-ViT assigns each class token to a specific category, and generates a corresponding class-level attention map by cross attention module, thus providing a per-class saliency region (distinguishable feature). Since those token patches having high attention scores with a specific class token are the saliency regions corresponding to a certain class, improving the interpretability and also boosting the performance of tasks where detecting and identifying multiple categories are needed. Besides, we use a novel non-parametric scoring method instead of the fully connected classifier in ViT. Specifically, since each class token actually performs binary classification, we can directly compute its \(\ell _2\) norm to obtain a classification score of the corresponding category for classification. Experimental results on multi-label classification show that our MCT-ViT achieves superior performance over the state-of-the-art on popular benchmark datasets while enjoying per-class interpretability without extra training. KeywordsVision transformerVisual interpretabilityMulti-class classification

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call