Abstract

Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the tensor shape is only known at runtime instead of compile time, auto-tuners must tune the tensor programs for every possible shape, leading to significant time and cost overhead. Additionally, if a tensor program tuned for one device is deployed on a different device, the performance may not be as optimal as before. To address these challenges, we propose HAOTuner, a hardware-adaptive deep learning operator auto-tuner specifically designed for dynamic shape tensors. We leverage the concept of micro-kernels as the unit of task allocation and have observed that the size of the micro-kernel greatly impacts performance. In HAOTuner, we determine the size of micro-kernels based not only on the tensor shapes but also on the available hardware resources. Specifically, we present an algorithm to select hardware-friendly micro-kernels as candidates, reducing the tuning time. We also design a cost model that is sensitive to hardware resources to support various hardware architectures. Furthermore, we provide a model transfer solution to enable fast deployment of the cost model on different hardware platforms. We evaluate HAOTuner on six different types of GPUs. The experiments demonstrate that HAOTuner surpasses the state-of-the-art dynamic shape tensor auto-tuner in terms of running time by an average of 26% and tuning time by 25%. Moreover, HAOTuner outperforms the state-of-the-art compiler with padding in terms of running time by an average of 39% and tuning time by 6×.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call