Multi-label image recognition is a fundamental but challenging computer vision and multimedia task. Great progress has been achieved by exploiting label correlations among these multiple labels associated with a single image, which is the most crucial issue for multi-label image recognition. In this paper, to explicitly model label correlations, we propose a unified deep learning framework to Disentangle, Embed and Rank (DER) the corresponding label cues. Specifically, we first obtain class-aware disentangled maps (CADMs) by reforming deep activations in accordance with the class-specific recognition weights. Then, after transforming CADMs into the corresponding label vectors, we propose an embedding operation from a metric learning perspective to pull the relevant label vectors together and push irrelevant label vectors away. Furthermore, a ranking operation is employed, which aims to accurately and robustly measure the similarity/dissimilarity of these label vectors. Our model can be trained in an end-to-end manner with only image-level supervision, during which the proposed embedding and ranking operations can contribute to the CADMs learning through back-propagation. In addition, the obtained CADMs are aggregated and further used as an essential feature stream for the final multi-label classification. We conduct extensive experiments on three commonly used multi-label benchmark datasets. Quantitative results show that our model can significantly and consistently outperform previous competitive methods. Moreover, qualitative analysis of our DER proposal also reveals the effectiveness of our proposed model.
Read full abstract