Abstract

Recently vision transformers (ViTs) have encountered the over-smoothing problem, which reduces their capacity by mapping input patches into a similar latent representation. Existing methods introduce regularization terms to alleviate over-smoothing but often increase computational costs. To address this, this paper proposes PatchSkip, a novel and flexible dropout variant, alleviating the over-smoothing problem of ViTs in a lightweight manner. Specifically, PatchSkip draws inspiration from the fact that a similar over-smoothing problem in GNNs is primarily caused by static adjacent matrices leading to solitary message passing mode between nodes. PatchSkip constructs graphs with patch embeddings and analyzes the adjacent matrices in ViTs. By randomly selecting specific patch embeddings to bypass transformer blocks, PatchSkip is proved to generate various adjacent matrices and acts as a multi-mode message passing engine, providing diverse modes of message passing between patches. The effectiveness of PatchSkip in preventing over-smoothing is demonstrated through theoretical proofs and empirical visualizations. Furthermore, PatchSkip is evaluated on various datasets and backbones, showing significant performance improvements while reducing computational costs. For example, when trained on Tiny-ImageNet from scratch, PatchSkip improves the performance of the vanilla CrossViT by 3.85% while reducing computational costs by over 20%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call