Abstract
Transformers designed for natural language processing have originally been explored for computer vision in recent research. Various Vision Transformers (ViTs) play an increasingly important role in the field of image tasks such as computer vision, multimodal fusion and multimedia analysis. However, to obtain promising performance, most existing ViTs usually rely on artificially filtered high-quality images, which may suffer from inherent noise risk. Generally, such well-constructed images are not always available in every situation. To this end, we propose a Robust ViT (RViT) to focus on the relevant and robust representation learning for image classification tasks. Specifically, we first develop a novel Denoising VTUnet module, where we conceptualize the nonrobust noise as the uncertainty under the variational conditions. Furthermore, we design a fusion transformer backbone with a tailored fusion attention mechanism to perform image classification based on the extracted robust representations effectively. To demonstrate the superiority of our model, the compared experiments are conducted on several popular datasets. Benefiting from the sequence regularity of the Transformer and captured robust feature, the proposed method exceeds compared Transformer-based models with superior performance in visual tasks.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have