Abstract

The state-of-the-art convolutional neural network (CNN) structures present growing irregularity in the sense of layer connections, which derives from the innovative manual designs and the recently proposed neural architecture searching approaches. Such irregular structures improve recognition accuracy, but also bring challenges for hardware deployment, especially on CNN accelerators with regular architectures: 1) the complicated data dependency makes it nontrivial to decide the data reuse strategy between layers and 2) since the execution order of each network is not unique, the choice of layer scheduling, memory allocating, and loop tiling strategies greatly impact the hardware performance. These challenges cannot be solved by the existing CNN schedulers, which mainly focuses on the dataflow of a single layer. In this work, we propose a comprehensive framework to analyze and solve the mapping of an arbitrarily connected CNN network to specific hardware accelerators. We propose: 1) a dynamic programming and node-clustering-based DAG partitioning approach to efficiently exploit interlayer data reuse and 2) a subgraph scheduling and on-chip memory allocating strategy to find the optimal execution order. With the modeling of CNN accelerators, we also propose a loop tiling approach for fused layers. An automated framework is established to generate binary machine codes from original CNN models produced by mainstream deep learning frameworks, which can process large-scale CNNs with more than 1000 layers in only a few minutes. Experiments based on state-of-the-art accelerators (e.g., NVDLA) show that our techniques greatly reduce the external data transfer of interlayer dependencies and bring significant performance improvement over existing approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call