Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU

Jiazhi Jiang,Dan Huang,Yutong Lu,Zhiguang Chen,Xiangke Liao,Jiangsu Du

doi:10.1109/tpds.2023.3280805

Jiazhi Jiang, Dan Huang + Show 4 more

https://doi.org/10.1109/tpds.2023.3280805

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The past several years have witnessed tremendous success of transformer models in natural language processing (NLP), and their current landscape is increasingly diverse. Although GPU gradually becomes the dominating workhorse and de facto standard for deep learning, there are still many scenarios where using CPU remains a prevalent choice. Recently, ARM many-core processor starts emigrating to cloud computing and high-performance computing, which is promising to deploy transformer inference. In this paper, we identify several performance bottlenecks of existing inference runtime on many-core CPU including low-core usage, isolated thread configuration, inappropriate implementation of general matrix multiply (GEMM), and redundant computations for variable-length inputs. To tackle these problems, full-stack and cross-layer optimizations are conducted for these challenges from deep learning service level to the neural network operator level. We explore multi-instance parallelization at the service level to improve CPU core usage. To improve parallel efficiency of the inference runtime, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations. The GEMM implementation is tailored for some critical modules such self-attention module to exploit the characteristics of transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing strategy is proposed for tasks with different sparsity. Experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for different transformer-based models with fixed-length inputs. For variable-length inputs, it achieves 1.9x to 6x speedups on Kungpeng920 processor and 3x to 8x speedups on Ampere Altra processor depending on different sequence lengths and batch sizes.

Full Text