Abstract

Vision Transformers (ViTs) have recently demonstrated state-of-the-art performance in various vision tasks, replacing convolutional neural networks (CNNs). However, because ViT has a different architectural design than CNN, it may behave differently. To investigate whether ViT has a different performance or robustness, we tested ViT and CNN under various imaging conditions in practical vision tasks. We confirmed that for most image transformations, ViT’s robustness was comparable or even better than that of CNN. However, for contrast enhancement, ViT performed particularly poorly. We show that this is because positional embedding in ViT’s patch embedding can work improperly when the color scale changes. We demonstrate that the use of PreLayerNorm, a modified patch embedding structure, ensures the consistent behavior of ViT. Results demonstrate that ViT with PreLayerNorm exhibited improved robustness in the contrast-varying environments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call