Abstract

Over the past three years, various vision transformers (ViTs) have been proposed and applied to many vision tasks. However, for x86 CPU platforms, current ViTs fail to effectively balance inference efficiency and accuracy because they are specifically designed for high-end GPU platforms or smart mobile platforms. In this paper, we aim to explore high-performance transformer architectures for CPU platforms via human-machine collaborative design. First, we attempt to discover CPU-unfriendly design choices in existing ViTs through several controlled experiments. According to our observations, we then design two CPU-efficient attention units for building transformers by human manual exploration. Finally, based on our basic units, we present a CPU-Aware network search method to automatically search for global architectures of transformers. Through the process of human-machine collaborative design, we can obtain ViT instances. Benefiting from our human-designed basic units and automatically-searched global structures, our ViT instances achieve a more favorable tradeoff between CPU-latency and accuracy. For CPUs based on x86 architecture, they are more competitive than their CNN-based and transformer-based counterparts. This suggests that our proposed human-machine collaborative solution is feasible for designing vision transformers tailored for specific hardware.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call