Traditional text-based person re-identification relies on identity labels. However, it is impossible to annotate large datasets, since identity annotation is expensive and time-consuming. Weakly supervised text-based person re-identification, where only text-image pairs are available without annotation of identities, is very practical in real life. While dealing with the weakly supervised person re-identification, two issues should be strengthed, i.e., alignment caused by different modal, and cross-modal matching ambiguity caused by the lack of identity labels. In this paper, we propose a Similarity Regulation and Calibration Alignment (SRCA) framework, which consists of two unimodal encoders for images and text respectively and a multi-modal encoder for the masked language modelling task. Firstly, a Similarity Regulation (SR) strategy is proposed to relax the strict one-to-one constraints for the local similarities between different pairs by introducing a novel soft objective. The soft objective can adjust hard objectives to achieve soft cross-modal alignment by establishing a many-to-many relationship between two modalities. Secondly, the Calibration Alignment (CA) module is proposed to improve intra-class compactness by modelling pseudo-label assignment as optimal transport. The ambiguity of cross-modal matching can be reduced by aligning features and pseudo-labels of different modalities and gradually calibrating the distribution of pseudo-labels. Experimental results show that our method has achieved obvious advantages compared with existing methods, and also demonstrated competitive performance compared with fully supervised methods.
Read full abstract