Abstract

Deep convolutional neural networks (CNNs) provide State-of-the-Art (SotA) results in a variety of machine learning (ML) applications, such as image classification and object detection. The trends towards higher-resolution input images and large-scale model structures have brought drastic performance improvements for these tasks. Yet, these trends also pose increased stress on hardware resources. Recent work proposed an advanced layer fusion method relying on the line-buffer depth-first (LBDF) processing, to reduce memory usage and off-chip memory accesses in high resolution CNN processing. However, when deploying such LBDF approach for residual networks, the required external memory access or on-chip memory capacity drastically increases again, hence canceling many benefits of the depth-first approach. This paper, therefore, outlines and analyzes the possible methods for handling residual connections (RC) in combination with LBDF. With the proposed ReLU-based compression and tiling techniques, we minimize the overhead of RC on on-chip memory requirements and external memory accesses. Moreover, a proposed hybrid strategy, optimally combining the different methods across different residual blocks of a residual network, yields for ResNet20 and SqueezeNet a reduction of 24.7% and 45.1% respectively in inference energy cost. Also the execution latency is reduced by 19.4% and 46.7% respectively in comparison to the best single method under certain on-chip memory budgets. In addition, compared to the SotAs, in the best case, our hybrid strategy only consumes 43% energy and 29.9% latency while use the same on-chip memories; only requires 27.4% on-chip memory to achieve the same performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call