Abstract

Deploying a deep learning model relies on highly optimized implementations for all tensor operators in the model, especially on mobile devices with limited hardware resources. To relieve the burden of manual optimization, researchers propose deep learning compilers to automatically optimize operators with auto-tuning and code generation techniques. The auto-tuning system typically constructs a large tuning space, from which various tensor programs are sampled and evaluated to find the best implementation. Unfortunately, this process is quite time-consuming, often requiring hours for a single operator. To address this issue, this paper presents XFC, a framework that enables fast performance tuning by operator synthesis. The key idea of XFC is to abstract the hand-tuning process and generate tensor programs by bottom-up and hierarchical construction. We implemented XFC based on TVM and conduct extensive experiments to verify its effectiveness. Experiments show that, for various operators and mobile devices, XFC significantly reduces the tuning time from hours to seconds (over 700× speedup) while ensuring comparable execution performance for compiled operators (11.5% performance loss on average).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call