Abstract

A common inefficiency in traditional Convolutional Neural Network (CNN) architectures is that they do not adapt to variations in inputs. Not all inputs require the same amount of computation to be correctly classified, and not all of the weights in the network contribute equally to generate the output. Recent work introduces the concept of iterative inference, enabling per-input approximation. Such an iterative CNN architecture clusters weights based on their importance and saves significant power by incrementally fetching weights from off-chip memory until the classification result is accurate enough. Unfortunately, this comes at a cost of increased execution time since some inputs need to go through multiple rounds of inference, negating the savings in energy. We propose Cache Reuse Approximation for Neural Iterative Architectures (CRANIA) to overcome this inefficiency. We recognize that the re-execution and clustering built into these iterative CNN architectures unlock significant temporal data reuse and spatial value reuse, respectively. CRANIA introduces a lightweight cache+compression architecture customized to the iterative clustering algorithm, enabling up to 9 × energy savings and speeding up inference by 5.8 × with only 0.3% area overhead.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call