Abstract

This paper proposes various techniques that help Vision Transformer (ViT) to learn small-size datasets from scratch successfully. ViT, which applied the transformer structure to the image classification task, has outperformed convolutional neural networks, recently. However, the high performance of ViT results from pre-training using large-size datasets, and its dependence on large datasets comes from low locality inductive bias. And conventional ViT cannot effectively attend the target class due to redundant attention caused by a rather high constant temperature factor. In order to improve the locality inductive bias of ViT, this paper proposes novel tokenization (Shifted Patch Tokenization: SPT) using shifted patches and a position encoding (CoordConv Position Encoding: CPE) using 1×1 CoordConv. Also, to improve poor attention, we propose a new self-attention mechanism (Locality Self-Attention: LSA) based on learnable temperature and self-relation masking. SPT, CPE, and LSA are intuitive techniques, but they successfully improve the performance of ViT even on small-size datasets. We qualitatively show that each technique attends a more important area and contributes to having a flatter loss landscape. Moreover, the proposed techniques are generic add-on modules applicable to various ViT backbones. Our experiments show, when learning Tiny-ImageNet from scratch, the proposed scheme based on SPT, CPE, and LSA increases the accuracy of ViT backbones by +3.66 on average and up to +5.7. Also, the performance improvement of ViT backbones in ImageNet-1K classification, learning on COCO from scratch, and transfer learning on classification datasets verify that the generalization ability of the proposed method is excellent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call