Deep learning Processor Unit (DPU) is a highly configurable CNN accelerator that supports a variety of CNNs and can be implemented with multiple instances on the same FPGA. Many applications deploy concurrent execution of different CNNs and in such a setting, an execution time predictor can help “optimize” the DPU configurations to meet the performance requirements of different tasks. We characterize CNN execution on DPUs and reduce the variability in execution time due to interference from the operating system. Subsequently, we propose a machine learning-based framework (EXPRESS) to predict the execution time of any given CNN on a DPU configuration, considering CNN, DPU, and bus characteristics. We improvise EXPRESS to support heterogeneous CNNs in EXPRESS-2.0 by making features independent of the number of CNNs. Our entire experimentation is based on data from a real FPGA board for 16 standard CNNs. Our frameworks, EXPRESS and EXPRESS-2.0, significantly outperform state-of-the-art by achieving an average execution time prediction error of 2.2% and 0.7%, respectively. We illustrate the effectiveness of this low prediction error for design space exploration, which is very useful for embedded system application developers.
Read full abstract