Unsupervised domain adaptation (UDA) aims to make full use of a labeled source domain data to classify an unlabeled target domain data. With the success of Transformer in various vision tasks, existing UDA methods borrow strong Transformer framework to learn global domain-invariant feature representation from the domain level or category level. Of them, the cross-attention as a key component acts for the cross-domain feature alignment, benefiting from its robustness. Intriguingly, we find that the robustness makes the model insensitive to the sub-grouping property within the same category of both source and target domains, known as the subdomain structure. This is because the robustness regards some fine-grained information as the noises and removes them. To overcome this shortcoming, we propose an end-to-end Cross-Subdomain Transformer framework (CSTrans) to exploit the transferability of subdomain structures and the robustness of cross-attention to calibrate inter-domain features. Specifically, there are two innovations in this paper. First, we devise an efficient Index Matching Module (IMM) to calculate the cross-attention of the same category in different domains and learn the domain-invariant representation. This not only simplifies the traditional daunting image-pair selection but also paves the safer way for guarding fine-grained subdomain information. This is because the IMM implements reliable feature confusion. Second, we introduce discriminative clustering to mine the subdomain structures in the same category and further learn subdomain discrimination. Both aspects cooperates with each other for fewer training stages. We perform extensive studies on five benchmarks, and the respective experimental results show that, as compared to existing UDA siblings, CSTrans attains remarkable results with average classification accuracy of 94.3%, 92.1%, and 85.4% on datasets Office-31, ImageCLEF-DA, and Office-Home, respectively.
Read full abstract