Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel Transformer Driven Matching Selection framework for Multi-Label Image Classification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.
Read full abstract