The CNN algorithm seeks high performance and energy efficiency in real-time inference. The costly off-chip memory accesses put additional burdens on CNN's execution. Towards avoiding off-chip accesses, we propose ALAMNI, a novel near-memory architecture that expedites the CNNs in the logic layer of the Hybrid Memory Cube. We exploit intra- and inter-vault parallelism to accelerate the highly parallel CNN operations. The proposed ALAMNI replaces costly multiplications of CNNs with lookaside memory (LAM) based searches. The proposed ALAMNI policy is effective on unseen data as it discards the data pre-profiling overhead by an adaptive LAM update policy. The ALAMNI controller keeps the most frequent triplets of weight (W), activation (A), and multiplication result (M), <W, A, M>, in the LAM to eliminate redundant computations. As an optimization, we incorporate a bitmasking concept to raise the hit rate of LAMs and further amortize computations. We also present a study on the relation between the amount of bitmasking and the loss of classification accuracy of the popular ConvNets. We keep the bitmasking as a reconfigurable feature of the ALAMNI units to achieve desired classification accuracy. Experimental results show substantial improvement in the system's performance and energy efficiency compared to the baseline and state-of-the-art.
Read full abstract