Abstract

Unsupervised Domain Adaptation (UDA) is a popular machine learning technique to reduce the distribution discrepancy among domains. In previous UDA methods, only convolutional neural networks (CNNs) or vision transformers (ViTs) are used as the backbone. Therefore, these methods are subjected to inherent characteristics of single-CNNs or single-ViTs. In fact, CNNs are more biased towards local textures, while ViTs are inclined to learn the shape information of images. In this paper, we consider that the model combining the above two bias properties is closer to the human visual neural system and achieves more robust performance. A novel mutual distillation method for UDA without any assistant networks is proposed. Mutual distillation between two backbones with complementary properties (i.e. CNNs & ViTs) can promote each other, leading to better domain knowledge transfer. Additionally, traditional domain-mixup approaches can only mix limited cross-domain information through linear interpolation. To encourage more cross-domain information interaction between the two backbones and bridge the domain gap, we propose a patch-mixup method, building a train of mixed intermediate domains composed of augmented patches. Meanwhile, a cross-domain semantic alignment loss is proposed to align the semantic information of these domains. Extensive experiments show that our method achieves state-of-the-art results on several standard UDA benchmarks, such as Office-31, Office-Home, ImageCLEF-DA and VisDA-2017.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call