Abstract
The extensive use of Deep Neural Networks (DNNs) encourages people to design domain-specific hardware called Artificial Intelligence (AI) processors. The novel hardware makes optimizations challenging without a proper performance model that reveals working details and performance implications. This paper presents a performance model, Verrocchio, for Huawei DaVinci AI Core, which predicts the execution time of real-world DaVinci kernels. We propose specially-crafted micro-benchmarks to identify contention source, runtime behaviors, and bandwidth sharing, which significantly determine performance. Since DaVinci Core adopts a binary semaphore mechanism for synchronization, Verrocchio views each instruction as a discrete event and manages its execution time based on the programming logic. For evaluation, Verrocchio achieves average error rates of 2.62% and 2.30% in sample kernels for single-core and double-core execution. We demonstrate an optimizing process of matrix multiplications with Verrocchio, achieving speedups of 1.70× for operators and 1.53× for applications and error rates of 5.06% and 5.25%.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.