An Empirical Study on Common Bugs in Deep Learning Compilers

Xiaoting Du,Jianjun Zhao,Lei Ma,Zheng Zheng

doi:10.1109/issre52982.2021.00030

Abstract

The highly diversified deep learning (DL) frame-works and target hardware architectures bring big challenges for DL model deployment for industrial production. Up to the present, continuous efforts have been made to develop DL compilers with multiple state-of-the-arts available, e.g., TVM, Glow, nGraph, PlaidML, and Tensor Comprehensions (TC). Unlike traditional compilers, DL compilers take a DL model built by DL frameworks as input and generate optimized code as the output for a particular target device. Similar to other software, DL compilers are also error-prone. Buggy DL compilers can generate incorrect code and result in unexpected model behaviors. To better understand the current status and common bug characteristics of DL compilers, we performed a large-scale empirical study of five popular DL compilers covering TVM, Glow, nGraph, PlaidML, and TC, collecting a total of 2,717 actual bug reports submitted by users and developers. We made large manual efforts to investigate these bug reports and classified them based on their root causes, during which five root causes were identified, including environment, compatibility, memory, document, and semantic. After labeling the types of bugs, we further examined the important consequences of each type of bug and analyzed the correlation between bug types and impacts. Besides, we studied the time required to fix different types of bugs in DL compilers. Seven important findings are eventually obtained, with practical implications provided for both DL compiler developers and users.

Full Text