Abstract

In the machine learning era, model inference efficiency is one of the most important issues for machine learning systems. It is a major challenge to find the optimal configuration in a huge search space as the combinations of kernel fusion, memory tiling, and thread allocation strategies result in highly variable and unpredictable inference performance. The problem is particularly pronounced in models with large parameter matrices such as Transformers. In this paper, we aim to develop a general and powerful framework for inference optimization, called NIOT, to achieve desirable efficiency for the prevailing Transformer-like models on CPUs. To take full advantage of modern CPU features such as SIMD and cache hierarchy, NIOT employs various methods to provide promising strategies tailored to the target Transformer model. Our C++ implementation of NIOT shows significant performance improvements over popular well-optimized model-serving runtimes such as PyTorch and ONNXRuntime.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.