This paper introduces NLTSP, a deep learning-based cost model designed to optimize tensor program performance in deep learning compilers. NLTSP, short for Nested Loop Tree Structure Processing, facilitates tensor program tuning by extracting information directly from the nested loop tree structure of sampled programs. NLTSP extracts features upstream in the compilation flow and eliminates the need for complex feature engineering. By utilizing a unified format for CPU and GPU architectures and extracting simple high-level features, NLTSP significantly accelerates feature extraction speed while maintaining performance accuracy. We have integrated this technology into Ansor, a leading search framework in the TVM compiler, and conducted experiments. Compared with TenSet MLP, the state-of-the-art cost model utilizing Ansor features as inputs, NLTSP achieves feature extraction speeds on CPU and GPU that are, on average, 97.9 times and 41.4 times faster, respectively, and can reduce the average search time for CPU and GPU workloads by 2.50 times and 4.11 times, respectively. It is worth noting that NLTSP is not specifically designed for Ansor. Any auto-tuning framework capable of representing scheduled tensor programs as nested loop trees can potentially benefit from using NLTSP to achieve superior performance. The code is available at https://github.com/xhq0/NLTSP.
Read full abstract