Efficient Scheduling of Irregular Network Structures on CNN Accelerators

Shibin Tang,Shouyi Yin,Leibo Liu,Daoli Ou,Shaojun Wei,Xianjue Zhang,Shixuan Zheng

doi:10.1109/tcad.2020.3012215

Abstract

The state-of-the-art convolutional neural network (CNN) structures present growing irregularity in the sense of layer connections, which derives from the innovative manual designs and the recently proposed neural architecture searching approaches. Such irregular structures improve recognition accuracy, but also bring challenges for hardware deployment, especially on CNN accelerators with regular architectures: 1) the complicated data dependency makes it nontrivial to decide the data reuse strategy between layers and 2) since the execution order of each network is not unique, the choice of layer scheduling, memory allocating, and loop tiling strategies greatly impact the hardware performance. These challenges cannot be solved by the existing CNN schedulers, which mainly focuses on the dataflow of a single layer. In this work, we propose a comprehensive framework to analyze and solve the mapping of an arbitrarily connected CNN network to specific hardware accelerators. We propose: 1) a dynamic programming and node-clustering-based DAG partitioning approach to efficiently exploit interlayer data reuse and 2) a subgraph scheduling and on-chip memory allocating strategy to find the optimal execution order. With the modeling of CNN accelerators, we also propose a loop tiling approach for fused layers. An automated framework is established to generate binary machine codes from original CNN models produced by mainstream deep learning frameworks, which can process large-scale CNNs with more than 1000 layers in only a few minutes. Experiments based on state-of-the-art accelerators (e.g., NVDLA) show that our techniques greatly reduce the external data transfer of interlayer dependencies and bring significant performance improvement over existing approaches.

Full Text