Abstract

This work delves into unveiling the singularity issue latent in global attention-based Transformers. Empirical and theoretical analyses elucidate that interrelationships among token channels lead to singularities, impeding the training of attention weights. Concretely, the similar neighbor pixels within image patches can form intercorrelated channels after being flattened. Images that one color dominates can possess correlated channels. Furthermore, the fixed global connection architecture retains correlation relationships, contributing to the persistence of singularities. High singularity risks reducing Transformers’ performance and robustness. Based on the singularity analysis, we propose the Token Singularity Removal (TSR) strategy. It incorporates the Dual-Tree Complex Wavelet Transform (DTCWT) stem and Feature Decorrelation (FD) loss, aiming to encourage Transformers to learn tokens with unrelated channels and eliminate singularities. Experimental validation across various image classification datasets and corruption image data sets demonstrate improved accuracy and robustness of Transformers utilizing the TSR strategy. Our code is publicly available at https://github.com/wdanc/TSR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call